Training Data Sets in Machine Learning Models


By Yuri Brigance

I have a particular interest in how to train data sets in machine learning models.

This year’s TC Robotics & AI conference had all the proof we need that consumer robotics, powered by the latest Machine Learning science, is quickly becoming a booming industry with lots of investor interest behind it. New Machine Learning (ML) architectures and training techniques are coming out almost every month. It was interesting to see how these algorithms are being used to create a new wave of consumer tech, as well as large numbers of service offerings springing up to make machine learning more user-friendly. How we train data sets in machine learning models is increasingly important.

superannotate.ai machine learning models

Training Data Sets

Training data sets in machine learning is one of the noticeable priorities in this new and growing ecosystem.

Machine Learning relies on A LOT of training data. Creating it is no easy feat. Much of it requires manual human effort to correctly label. A lot of companies have sprung up to help address this problem, and make data collection and labeling faster and easier, in some ways automating it completely.

Aside from labeling, collecting such training data can be just as difficult. Self-driving cars are a well-known example — we’ve all heard of, and maybe even seen autonomous vehicles being tested on public roads. However, it might come as a surprise that most of those driving miles aren’t used for training data collection.

As Sterling Anderson of Aurora and Raquel Urtasun of Uber explained, most self-driving technologies are actually trained in simulation. The autonomous fleets are out testing the trained models in the real world. On occasion the system will disengage and flag a new scenario. The disengagement condition is then permuted thousands of times and becomes part of the simulation, providing millions of virtual miles for training purposes. It’s cost efficient, scalable, and very effective.

Creating such simulations is not trivial. In order to provide the right fidelity, not only must the virtual world must look visually hyper-realistic, but all the sensor data (lidar, radar, and a hundred others) must also be perfectly synced to the virtual environment. Think flight simulator, but with much better graphics. In many cases, sensor failures can be simulated as well, and self-driving systems need to be able to cope with the sudden loss of input data.

Visual data is notoriously difficult to label. Simulation aside, imagine if you are tasked with outlining all the cars, humans, cats, dogs, lamp posts, trees, road markings, and signs in a single image. And there are tens of thousands of images to go through. This is where companies like SuperAnnotate and ScaleAI come in.

SuperAnnotate provides a tool that combines superpixel-based segmentation with humans in the loop to allow for rapid creation of semantic segmentation masks. Imagine a drone orthomosaic taken over a forest with a variety of tree species — tools like this allow a human to quickly create outlines around the trees belonging to a specific category simply by clicking on them.

SuperAnnotate’s approach is interesting, but it likely won’t be sufficient for all scenarios. It’s useful for situations where you have well defined contrasting edges around the objects you are attempting to segment out, but it would likely not work so well for less defined separation lines. A good example is when you may want to figure out where the upper lip ends and the upper gum begins in portraits of smiling people. This will likely require a custom labeling tool — something we at Kopius have created on a number of occasions.

ScaleAI takes a different approach, and relies on a combination of statistical tools, machine learning checks, and most importantly, humans. This is a very interesting concept — effectively a Mechanical Turk for data labeling.

So it is quickly becoming apparent that data collection and training are whole separate pillars of the ML-powered industry. One might imagine a future where the new “manual labor” is labeling or collecting data. This is a fascinating field to watch, as it provides us with a glimpse of the kinds of new jobs available for folks who are now under threat of unemployment via automation. With one caveat — these systems are distributed, so even if you get a gig as a human data labeler, you may be competing with folks from all over the world, which has immediate income implications.

On the other hand, setting up simulations and figuring out ways to collect “difficult” data may be an entire engineering vertical on its own. As a current video game, AR/VR, or a general 3D artist/developer, you might find your skills very applicable in the AI/ML world. A friend of mine recently found an app that allows you to calculate your Mahjong score by taking a photo of your tiles. How would you train a model to recognize these tiles from a photo, in various lighting conditions and from all angles? You could painstakingly take photos of the tiles and try to label them yourself, or you could hire a 3D artist to 3D model the tiles. Once you have realistic 3D models, you can spin up a number of EC2 instances running Blender (effectively a “render farm” in the cloud). Using Python, you can then programmatically script various scenes (angles, lights, etc.) and use Blender’s ray-tracing engine to crank out thousands of pre-labeled 3D renders of simulated tiles in all sorts of positions, angles, colors, etc.

But what if your task is to detect weather conditions (wind, rain, hail, thunder, snow) via a small IoT device with just a cheap microphone as a sensor. Where do you get all the training sounds to create your model? Scraping YouTube for sound can only get you so far — after all, those sounds are recorded with different microphones, background noises, and varying conditions. In this case, you may opt to create physical devices designed specifically for this kind of data collection. These may be expensive but might contain the required set of sensors to accurately record and label the sound you’re looking for, using the microphones you’ll use in production. Once the data is collected, you can train a model and run inference on a cheap edge device. Coming up with such data collection techniques can be an engineering field of its own, and execution requires manual labor to deploy these techniques in the field. It’s an interesting engineering problem, one that will undoubtedly give birth to a number of specialized service and consulting startups.

Here at Kopius we have the necessary talent to collect the data you need, either via crowdsourcing, simulating (we do AR/VR in-house and have talented 3D artists), using existing labeling tools, building custom labeling tools, or constructing physical devices to collect field data. We’re able to set up the necessary infrastructure to continuously re-train your model in the cloud and automatically deploy it to production, providing a closed-loop cycle of continuous improvement.

JumpStart Your Success Today

Kopius supports businesses seeking to govern and utilize AI and ML to build for the future. We’ve designed a program to JumpStart your customer, technology, and data success. 

Tailored to your needs, our user-centric approach, tech smarts, and collaboration with your stakeholders, equip teams with the skills and mindset needed to:

  • Identify unmet customer, employee, or business needs
  • Align on priorities
  • Plan & define data strategy, quality, and governance for AI and ML
  • Rapidly prototype data & AI solutions
  • And, fast-forward success

Partner with Kopius and JumpStart your future success.

Additional Resources

Artificial Intelligence: How Smart Is It?


How Smart is AI?

So are computers ready to take over the world and subjugate the human race, given our inferior intelligence and processing power? How smart is artificial intelligence?

That’s been a major Hollywood theme for decades. Who doesn’t remember the chilling lines in 2001: A Space Odyssey, “I’m sorry, Dave. I’m afraid I can’t do that.” when it suddenly becomes clear that the supercomputer HAL has gone rogue.

Or the ominous scene in Bladerunner, when escaped “replicant” Leon murders the police officer administering a diagnostic test of his humanity (or, in this case, lack of humanity).

Although there are real concerns about setting AI free in the world, much of the media-hyped fear about the coming AI apocalypse is overblown. And even if there are valid technological and ethical considerations, the technology is still a long way off from that point.

Here’s how Andrew Ng, the chief scientist at Baidu from 2014 to 2017, put it in an interview with Vox earlier this year: “Worrying about evil-killer AI today is like worrying about overpopulation on the planet Mars. Perhaps it’ll be a problem someday, but we haven’t even landed on the planet yet.” (He does believe we should be thinking about how AI will displace the workforce of tomorrow, though.)

The Power of Artificial Intelligence and Machine Learning

The reality is that artificial intelligence and machine learning (let’s add some acronyms: AI and ML), are incredibly powerful technologies. They are able to find patterns in mind-boggling quantities of data orders of magnitude faster than humans. Plus they can learn to recognize objects and predict outcomes, and they get better at that over time. So while they are not likely to turn into evil killers of humanity, they are likely to transform, well, everything.

They will absolutely change the way we interact with the world — through Natural Language Processing (Hey, Siri, take me home.) and computer vision systems. Soon enough we will be able to initiate voice commands like “OK, Google, take me to the mountain in this photo.” Already Facebook can tag you practicing that embarrassing dance move at your best friend’s bachelor party. Or giving the keynote at an industry convention, for that matter.

AI Applications Across Industries

AI and ML will automate many of the boring or repetitive tasks that people perform now, which will transform the future workforce. Think of virtual assistants who schedule meetings and send automatic follow-up messages or appointment reminders.

We already have giant industrial robots that manufacture cars, and they are only getting smarter — like knowing how to avoid injuring people or even scheduling their own tune-ups so they don’t break down and cause expensive, disruptive work stoppages.

AI in the Automotive Industry

Autonomous vehicles rely on many of these AI systems strung together: a series of sensors — including video cameras, LIDAR, sonar, and motion sensors — detect the environment and feed that data to the car’s processing systems, which then analyze and act in real-time. The technology is marching forward at breakneck speed, with VC investments and high-profile acquisitions constantly making the news.

Although fully autonomous vehicles are many years off, multiple features that use AI tech are completely functional in cars on the road right now. These include adaptive cruise control, automatic emergency braking, lane departure warning, lane keeping assist, and front collision warning systems, to name a few.

AI Applications in Healthcare

Healthcare is also an area of incredible promise when it comes to AI and machine learning. IBM’s Watson Health mines health data to find patterns that no human mind would be powerful enough to recognize. This will help speed drug discovery, detect insurance fraud, and create personalized plans to keep people healthy, among other innovations.

Accessible Cloud-Based AI Innovations

It all sounded so sci-fi only a few decades ago. But now AI is upon us, and the speed of discovery is accelerating. That’s in part thanks to the availability of cloud-based AI services like Amazon’s Lex and Rekognition, which enable you to add voice and video recognition into your own systems. Or Microsoft’s AI services, which let you add analytics, speech recognition, and machine learning. Or the ubiquitous Google Translate, which can translate text, entire web pages, and even the writing on the outside of packaged goods in multiple languages all over the world. What used to be open only to Google, Facebook, and world superpowers is now accessible to everyone.

JumpStart AI for Your Business with Kopius

Kopius supports businesses seeking to govern and utilize AI and ML to build for the future. We’ve designed a program to JumpStart your customer, technology, and data success. 

Tailored to your needs, our user-centric approach, tech smarts, and collaboration with your stakeholders, equip teams with the skills and mindset needed to:

  • Identify unmet customer, employee, or business needs
  • Align on priorities
  • Plan & define data strategy, quality, and governance for AI and ML
  • Rapidly prototype data & AI solutions
  • And, fast-forward success

Partner with Kopius and JumpStart your future success.


Related Services:


Additional Resources:


Chatbots: Much More Than A Novelty

Chatbots: Much More Than A Novelty

The promise of Artificial Intelligence and chatbots is here.

Sure, humanoid robots s aren’t yet roaming the earth, but AI-induced applications and AI-infused services are transforming the world around us into a more intelligent, interactive, and empowered domain. Looking for a good example? Ask Siri, Alexa, Cortana, or CleverBot. They, collectively, are the answer.

Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, and Google’s Cleverbot are all examples of chatbots — “a computer program which conducts a conversation via auditory or textual methods.” Some chatbots use natural language processing ability to understand your speech and then respond verbally. Apple’s Siri is perhaps the most famous example of this type of chatbot, though Alexa and Cortana are also widely used. Other chatbots are text-based, responding to typed questions, commands, or observations. Microsoft’s Xiaoice, for example, was released in China in 2014 and, as of only a year later, had already been used by over 40 million smartphone owners (25% of whom had reportedly said “I love you” to their “virtual friend,” which is available on China’s two most prominent social media platforms — Weibo and WeChat).

Chatbots have been the subject of controversy — see Microsoft’s Tay — and frequent comic derision — see, e.g. Siri. More generally, many people see them as little more than a novelty — a fun way for consumers to interact with technology. But they are much much more than that. Simply put, chatbots are a powerful example of the proliferation of Artificial Intelligence into mainstream society. And we are just scratching the surface of their capabilities.

To-date, the landscape of chatbots available for consumers and enterprises has been dominated largely by the tech titans mentioned above. It is in the process, though, of getting significantly more diverse and dynamic, a phenomenon driven by the release of numerous chatbot frameworks for developers.

Chatbot frameworks are essentially software development kits (SDKs) for the AI-verse. They provide a platform — the technology infrastructure — for developers to build chatbots in a manner which meets their needs. The release of frameworks like Microsoft’s Bot Framework and Facebook’s Bot Engine (wit.ai) means that any developer, be they a hobbyist or professional service provider, can build a chatbot to improve their life or the lives of those around them.

Want to build a chatbot that speaks to you in Captain Hook lingo in time for the annual Talk Like a Pirate Day (September 19)? Have at it! Think your business can benefit from a chatbot designed to provide a more intuitive way to access and organize the data that fuels your success? Build it!

…or let us build it! Valence understands that chatbots are more than a novelty; they are a paradigm shifting technology that can digitally transform businesses in any sector. That’s why we’re putting them to work for our clients in ways that support both their strategic objectives and their day-to-day tactics. And that’s why we’re looking forward to learning how we can put them to work for you.

Learn How to JumpStart AI For Your Business

Kopius supports businesses seeking to govern and utilize AI and ML to build for the future. We’ve designed a program to JumpStart your customer, technology, and data success. 

Tailored to your needs, our user-centric approach, tech smarts, and collaboration with your stakeholders, equip teams with the skills and mindset needed to:

  • Identify unmet customer, employee, or business needs
  • Align on priorities
  • Plan & define data strategy, quality, and governance for AI and ML
  • Rapidly prototype data & AI solutions
  • And, fast-forward success

Partner with Kopius and JumpStart your future success.


Additional Resources