Is Your Data Ready for Machine Learning?

Dan Tynan, Radius Guest Contributor

Artificial intelligence is changing the world more rapidly than anyone could have predicted a few years ago. The explosion in available data, coupled with low-cost computing power and dramatic advances in AI capabilities, will enable organisations to optimise their operations, personalise their products, and anticipate future demand.

Yet, a recent survey shows that most companies still aren’t fully prepared to take advantage of this technology.

In partnership with Dell Technologies and Intel, Forbes Insights surveyed more than 700 top executives about their plans for AI and machine learning. While three-out-of-four CxOs say AI is a core component of their digital transformation plans, less than 25 percent implemented it anywhere within their organisation.

Just 11 percent have executed an enterprise-wide data strategy, and a scant 2 percent say they have a solid data governance process in place. In fact, while most organisations have more data than they know what to do with, much of it remains siloed, unstructured or otherwise ill-prepared for use in machine learning models.

Without the right data, AI initiatives will fail.

How Much Data Is Enough?

Organisations should start their AI journey by figuring out the questions they want to answer and the predictive capabilities they’d like to develop, says Josh Simons, senior director and chief technologist for high performance computing at VMware. That will determine the data they should collect.

But the amount and types of data organisations will need also depends on whether they’re using supervised or unsupervised machine learning models.

Supervised learning trains the model to look for specific results. It’s what allows Amazon Alexa to understand what you’re saying or your iPhone to unlock when it sees your face. Supervised learning requires a significant volume of labelled data, but it can allow you to build powerful predictive models.

Unsupervised learning involves analysing pools of raw data to detect patterns and identify anomalies,such as combing through computer security logs to flag potential cyberattacks. The amount of data you need depends on what you want the model to do.

Some enterprises will start with an unsupervised model to identify patterns in data, and then use those to structure the data for use with a supervised one, says Cambron Carter, director of engineering for GumGum, a computer vision company that builds AI solutions for the advertising, medical, and sports industries.

“It’s as if I dumped a bag of marbles onto a table and told you to sort them, without telling you anything else,” Carter says. “You could sort them by size, colour, design or whatever. But you’re going to impose some kind of structure on those marbles.”

If you’re training a robotic arm to identify parts passing by on an assembly line, you can start with a set of a few thousand labelled images, or even fewer depending on the task, says Carter. If you’re dealing with more complex tasks—like diagnosing cavities on dental x-rays or identifying logos on Formula One race cars as they zoom past—you’ll need to start with a significantly larger set of labelled data.

But volume alone isn’t enough; the data also needs to represent what you’d encounter in real-world scenarios, he adds.

“Let’s say I’m trying to train a model to recognise ten different animals and I’ve collected a million images to do so,” he says. “If 900,000 of them are tigers, the system is going to learn that if it predicts ‘tiger,’ it will be right 90 percent of the time.” The model may appear to be highly accurate, but it will offer poor predictive value in the real world.

You’ll also need three distinct pools of data: one to train your model, another to validate that it’s accurate and a third set of data to test it before putting it into production, says Yiwen Huang, CEO of, an automated machine learning platform.

Getting to the Goldilocks data set is tricky. Start with too little data, and you risk overfitting—creating a model that works well with your training set but poorly when it encounters new data. Use an insufficiently diverse mix of data, and you could be biasing the results in a particular direction.

“The unfortunate answer is there’s a lot of trial and error,” says Carter. “It depends on the scope, what you’re trying to learn, and how many variables you’re dealing with. There’s an artisanal component to it as well as an experiential one.”


Man vs. Machine: In the Artificial Intelligence Future, Will AI Take Our Jobs?



Is Your Data Clean?

One you have the data locked in, you need to prep it—removing duplicates, ensuring fields are formatted consistently and so on. It’s not a task you want to leave to highly paid, hard-to-find analytics experts, yet that’s something many companies do, says Simons.

“The data science community complains a lot about how much of their time is spent collecting data and getting it into a format they can actually feed into these algorithms,” he says.


“It can be a hugely complicated task, and if you don’t do it right, it’s garbage-in garbage-out.”


If you’ve got a large pool of unstructured data, such as collections of random images, you’ll need to assign labels that help the machine learning model understand what it’s looking at. That may require bringing in subject matter experts to fuel the learning process by manually labelling images. For example, doctors identifying which x-rays indicate the presence or absence of tumours.

The process can be challenging, expensive and time-consuming, Simons adds. Businesses will need to decide whether it makes more sense to build their own data sets or acquire pre-labelled ones.

Read “AI Technology in the Real World: Stories from 5 IT Leaders”

In some cases, organisations may want to mix both structured and unstructured data to identify common threads, such as records from a CRM database with text comments from user forums.

“There’s a lot of potentially valuable unmined data in those forums,” he says. “You can apply sentiment analysis and use it as additional signal alongside your more structured business system data.”

How Do You Get Started?

Building out a machine learning platform can be time and resource intensive. Data scientists are in high demand, and the return on investment is far from guaranteed. But there are a few ways organisations can ease into it.

“The best thing to do is start small,” says Huang.


“Identify a use case that has value, but is also practical and where data is readily available. The best way to understand whether you have the right data set is to try it out. But it can be really expensive to do that.”


Huang says companies can upload their data sets to the platform, which builds the optimal model for them using one of 20 commonly used learning algorithms. In roughly an hour, he says, they’ll be able to find out if their data will yield a high-quality model.

There are a growing number of AI startups that can take your data and build a rudimentary model for you, adds Simons. Organisations can also get a head start by taking algorithms that are already trained on data similar to theirs, and customising them—an approach known as transfer learning.

Then, there are the huge AI-as-a-service platforms such as Amazon Rekognition, Clarifai and Google Cloud Vision API. But those tend to be limited in functionality and expensive at scale.

“It will soon be table stakes for every company to have a chat bot or some kind of image recognition,” says Simons.


“What will really provide business value is differentiated machine learning, where organisations are applying these techniques to their own data and solving their own problems.”


And if you haven’t already started down this road, you’re behind the curve.

Enterprise AI Technology: Moving from Application-Aware to Customer-Aware