Breaking the Data Bottleneck

Each day, we’re coming into contact more and more with artificial intelligence and machine learning that is meant to make our lives better. We’ve all had some A.I. experiences that have gone really well. Perhaps, we didn’t even realize A.I. was helping us at first. On the other hand, getting help from A.I. doesn’t always work out perfectly, at least not right away. So why the inconsistency? If the human mind can take in so much complex information and make sense of it, why can’t our computers? Or can they if they have good data to learn from? Brad Porter, CTO of Scale AI, believes the key to A.I. learning efficiently is the right labeling:

“What you need is those samples to be labeled perfectly because if they’re labeled ambiguously, then the model can’t actually decide what exactly is signal versus noise. So one way to solve that is to throw more and more data at it. Eventually you have enough data that the algorithms learn, okay, this is the signal and all these other pieces are the noise. If you get [a] really high quality signal, though, you can learn that signal very quickly if there’s not a lot of noise in it.”

Computers need lots of data to learn. More accurately, they really need lots of quality data labeled properly. Fundamentally, this just makes sense. The best way to learn something is through repeated exposure and practice. This is just as true for people as it is for computers. That’s where Brad comes in. On this episode of IT Visionaries, Brad explains how his diverse work experience, particularly his work in robotics, ultimately led him to focus on solving the problem of data labeling for A.I, which is setting us up for an exciting future. After all, if proper labeling is the key, and the key is becoming more readily available, then we can expect great things in the A.I. space. Brad discusses some of those great things, including how the tech will help us understand medical histories and its use in autonomous vehicles. Enjoy the episode!

Main Takeaway

Breaking the Data Bottleneck: There is a lot of data in the world for A.I. to access. The primary issue for machine learning is for the computer to be able to distinguish what information is most important so it can learn. In this way, people and computers are similar. But computers need our help to know what data is essential.

Labeling Data is Key: It’s easy to get caught up in the glamorous possibilities of A.I. and how it can help us. Computers need data to learn, but they need the right data to learn effectively and efficiently. Labeling data is essential to speed up the pace in computer learning.

What is Signal Vs. What is Noise: Proper labeling helps A.I. distinguish between signal as opposed to noise. A.I. doesn’t necessarily need massive amounts of data to learn if the right, properly-labeled data is being provided.
Quantity vs Quality: Without proper labeling, there has been a tendency to simply inundate A.I. with data so learning can happen eventually. Of course, this is inefficient and costly. Proper labeling streamlines this process. In an ideal situation for learning, there’s a tremendous amount of data that’s also all properly labeled. With large amounts of properly labeled, automated data, A.I. has a real chance to take off.

For a more in-depth look at this episode, check out the article below.

Article

Each day, we’re coming into contact more and more with artificial intelligence and machine learning that is meant to make our lives better. We’ve all had some A.I. experiences that have gone really well. Perhaps, we didn’t even realize A.I. was helping us at first. On the other hand, getting help from A.I. doesn’t always work out perfectly, at least not right away. So why the inconsistency? If the human mind can take in so much complex information and make sense of it, why can’t our computers? Or can they if they have good data to learn from? Brad Porter, CTO of Scale AI, explained the key to A.I. learning efficiently is the right labeling:

“What you need is those samples to be labeled perfectly because if they’re labeled ambiguously, then the model can’t actually decide what exactly is signal versus noise,” Porter said. “So one way to solve that is to throw more and more data at it. Eventually you have enough data that the algorithms learn, okay, this is the signal and all these other pieces are the noise. If you get [a] really high quality signal, though, you can learn that signal very quickly if there’s not a lot of noise in it.”

“Really what resonated for me was I felt that the A.I. ecosystem was really focused on model building,” Porter said. “How do you build these models? How do you evaluate model performance? And it felt like it was just missing the mark. Everything I was doing in robotics, the model building was not the hard part. The model building was [if we] tuned some hyper-parameters, we’d solve that in a few days or a couple of weeks and then we were spending months and months or years trying to get the right data [and] trying to get it labeled correctly.”

Traditionally, because of a lack of quality data, there has been a tendency to simply inundate A.I. with as much data as possible regardless of its quality. The equivalent would be something like needing three bags of topsoil to complete a gardening project but dumping three yards of stones and dirt that does include some topsoil mixed in right on the flower bed. Sure, the required amount of topsoil is there somewhere but it’s also a huge mess. Porter described the ideal A.I learning strategy as providing lots of high quality data, labeled correctly.

“This is the classic trade-off of how much do you invest in getting more data versus how much do you invest in the quality of the data that you’re labeling and annotating,” Porter said. “In general, the answer is you ideally do both. You try to get a lot of data and you try to label it incredibly well.”

It is clear that the proper labeling of data will drive A.I. technology forward by increasing the efficiency of learning by providing quality data. Just like humans, computers need high quality information to learn. They also need our help in identifying what is the right data to draw upon. If we can provide them with the labels, then the possibilities are quite amazing and the learning can really speed up.

“I do think it’s going to continue to accelerate, or the tools will keep getting better and we will start to automate this,” Porter said. “You can see how I’m starting to paint a picture of this kind of automated data flow, where the data starts flowing automatically to improve the models that you have. If we can close those types of loops, it’ll evolve more quickly.”

To hear more about how Porter made a move from robotics into the A.I. space as well as how proper data labeling can lead to tremendous machine learning advances, check out the full episode of IT Visionaries!

To hear the entire discussion, tune into IT Visionaries here.

Looking to sponsor a Mission channel and community around your industry?

Breaking the Data Bottleneck

Menu

Looking to sponsor a Mission channel and community around your industry?

Looking to sponsor a Mission channel and community around your industry?

Breaking the Data Bottleneck

Menu

Looking to sponsor a Mission channel and community around your industry?

Share this