Training Data Sets in Machine Learning Models


By Yuri Brigance

I have a particular interest in how to train data sets in machine learning models.

This year’s TC Robotics & AI conference had all the proof we need that consumer robotics, powered by the latest Machine Learning science, is quickly becoming a booming industry with lots of investor interest behind it. New Machine Learning (ML) architectures and training techniques are coming out almost every month. It was interesting to see how these algorithms are being used to create a new wave of consumer tech, as well as large numbers of service offerings springing up to make machine learning more user-friendly. How we train data sets in machine learning models is increasingly important.

superannotate.ai machine learning models

Training Data Sets

Training data sets in machine learning is one of the noticeable priorities in this new and growing ecosystem.

Machine Learning relies on A LOT of training data. Creating it is no easy feat. Much of it requires manual human effort to correctly label. A lot of companies have sprung up to help address this problem, and make data collection and labeling faster and easier, in some ways automating it completely.

Aside from labeling, collecting such training data can be just as difficult. Self-driving cars are a well-known example — we’ve all heard of, and maybe even seen autonomous vehicles being tested on public roads. However, it might come as a surprise that most of those driving miles aren’t used for training data collection.

As Sterling Anderson of Aurora and Raquel Urtasun of Uber explained, most self-driving technologies are actually trained in simulation. The autonomous fleets are out testing the trained models in the real world. On occasion the system will disengage and flag a new scenario. The disengagement condition is then permuted thousands of times and becomes part of the simulation, providing millions of virtual miles for training purposes. It’s cost efficient, scalable, and very effective.

Creating such simulations is not trivial. In order to provide the right fidelity, not only must the virtual world must look visually hyper-realistic, but all the sensor data (lidar, radar, and a hundred others) must also be perfectly synced to the virtual environment. Think flight simulator, but with much better graphics. In many cases, sensor failures can be simulated as well, and self-driving systems need to be able to cope with the sudden loss of input data.

Visual data is notoriously difficult to label. Simulation aside, imagine if you are tasked with outlining all the cars, humans, cats, dogs, lamp posts, trees, road markings, and signs in a single image. And there are tens of thousands of images to go through. This is where companies like SuperAnnotate and ScaleAI come in.

SuperAnnotate provides a tool that combines superpixel-based segmentation with humans in the loop to allow for rapid creation of semantic segmentation masks. Imagine a drone orthomosaic taken over a forest with a variety of tree species — tools like this allow a human to quickly create outlines around the trees belonging to a specific category simply by clicking on them.

SuperAnnotate’s approach is interesting, but it likely won’t be sufficient for all scenarios. It’s useful for situations where you have well defined contrasting edges around the objects you are attempting to segment out, but it would likely not work so well for less defined separation lines. A good example is when you may want to figure out where the upper lip ends and the upper gum begins in portraits of smiling people. This will likely require a custom labeling tool — something we at Valence have created on a number of occasions.

ScaleAI takes a different approach, and relies on a combination of statistical tools, machine learning checks, and most importantly, humans. This is a very interesting concept — effectively a Mechanical Turk for data labeling.

So it is quickly becoming apparent that data collection and training are whole separate pillars of the ML-powered industry. One might imagine a future where the new “manual labor” is labeling or collecting data. This is a fascinating field to watch, as it provides us with a glimpse of the kinds of new jobs available for folks who are now under threat of unemployment via automation. With one caveat — these systems are distributed, so even if you get a gig as a human data labeler, you may be competing with folks from all over the world, which has immediate income implications.

On the other hand, setting up simulations and figuring out ways to collect “difficult” data may be an entire engineering vertical on its own. As a current video game, AR/VR, or a general 3D artist/developer, you might find your skills very applicable in the AI/ML world. A friend of mine recently found an app that allows you to calculate your Mahjong score by taking a photo of your tiles. How would you train a model to recognize these tiles from a photo, in various lighting conditions and from all angles? You could painstakingly take photos of the tiles and try to label them yourself, or you could hire a 3D artist to 3D model the tiles. Once you have realistic 3D models, you can spin up a number of EC2 instances running Blender (effectively a “render farm” in the cloud). Using Python, you can then programmatically script various scenes (angles, lights, etc.) and use Blender’s ray-tracing engine to crank out thousands of pre-labeled 3D renders of simulated tiles in all sorts of positions, angles, colors, etc.

But what if your task is to detect weather conditions (wind, rain, hail, thunder, snow) via a small IoT device with just a cheap microphone as a sensor. Where do you get all the training sounds to create your model? Scraping YouTube for sound can only get you so far — after all, those sounds are recorded with different microphones, background noises, and varying conditions. In this case, you may opt to create physical devices designed specifically for this kind of data collection. These may be expensive but might contain the required set of sensors to accurately record and label the sound you’re looking for, using the microphones you’ll use in production. Once the data is collected, you can train a model and run inference on a cheap edge device. Coming up with such data collection techniques can be an engineering field of its own, and execution requires manual labor to deploy these techniques in the field. It’s an interesting engineering problem, one that will undoubtedly give birth to a number of specialized service and consulting startups.

Here at Valence we have the necessary talent to collect the data you need, either via crowdsourcing, simulating (we do AR/VR in-house and have talented 3D artists), using existing labeling tools, building custom labeling tools, or constructing physical devices to collect field data. We’re able to set up the necessary infrastructure to continuously re-train your model in the cloud and automatically deploy it to production, providing a closed-loop cycle of continuous improvement.

Additional Resources