Association for Advancing Automation Logo

Member Since 1974


Content Filed Under:



Data Preparation: the Foundation for Success in Supervised ML

POSTED 09/07/2022

 | By: Kristin Lewotsky, A3 Contributing Editor

When it comes to machine learning (ML) projects, it’s easy to get caught up in algorithms and model building. After all, the whole point of an ML project is to develop a model that can operate on new data to answer questions, uncover insights, increase safety, and reduce human error. A model is only as good as the datasets it uses, however. In fact, data preparation is arguably the most important step in the process. This particularly holds for supervised ML, which depends on labeled datasets to train a model to make decisions autonomously (or semi-autonomously).

For an example of the importance of the training dataset, look no further than Amazon. In 2014, the company launched an AI initiative to prescreen resumes. The development team used a training set based on 10 years of resumes. Initially, the model seemed to do well—until the developers realized that for technical roles, the model was biased against female applicants. The problem was not a sexist ML tool. It was that the dataset of optimal resumes included primarily submissions from men rather than women—essentially, the dataset trained the model that the best applicants were male and that they should be favored over women. Subsequently, according to Reuters, the project was disbanded.1

Data preparation is crucial to the success of an ML initiative. It’s also the step that requires the most input from the end-user. Companies who want to apply supervised ML don’t need to have data scientists in house. They can hire outside consultants or work with vendor partners. There’s no substitute for domain expertise, though. The end user’s engineering teams, shop floor staff, and IT department provide the context that connects data to physical phenomena. Their inside knowledge of operations, equipment, and products brings a perspective that is critical to the extraction and preprocessing steps.

Here, we review key steps in developing an effective training dataset for supervised ML. By following best practices from data collection through feature generation, industrial companies can lay the groundwork for ML projects that can help improve product quality, reduce downtime, and streamline operations.

Define the Problem

Given the importance of data, a project should kick off with gathering as much as possible, right? Wrong. Successful ML projects don’t start with data, they start with identifying the business need that the project is going to serve. Industrial companies are increasingly using supervised ML for applications like automated visual inspection (AVI), predictive maintenance, and supply-chain management. Objectives should not just be identified but also quantified, where possible. Is the goal to increase inspection throughput by 10%? Cut downtime by 20%? Boost performance consistency across facilities? Determine the goal and expectations at the start of the project.

It’s also important to confirm that the business question/problem is a good fit for ML. “You want to understand the possible business returns and whether the problem itself is a good one to phrase as a machine learning project,” says Ivan Zhou, Senior machine learning engineer at Landing AI (Palo Alto, CA). The key, he says, is to ensure that the figure of merit identified in the project aligns with the business goal. “Sometimes users will find an academic paper that seems relevant to what they’re doing so they will decide to use the same evaluation metrics, like accuracy. But let’s say their business need is really to detect possible defects at high speed. Now, they are spending their efforts optimizing for the wrong target. That's why we always encourage customers to think through and choose their metrics carefully.”

Collect the Data

At this point, we need to start thinking about data. What type of data is required to address the business question and is it available? The modern industrial environment is awash in data, from vision systems for product inspection and process control to embedded sensors in smart components like drives and PLCs. Increasingly, devices include data loggers, and shop floors may have data historians or even edge computing capabilities. Other sources of aggregated data include SCADA systems, as well as ERP and MES applications.

For condition monitoring use cases, additional sensors may be necessary, but that isn’t necessarily an issue. “It has become easier to do this, especially with IoT-enabled devices,” says Uziel Salgado, cofounder of IndustLabs (Richardson, TX). “You can pretty much put a sensor on anything, with a battery backup and a connection.”  

“In my experience, these plants are generally full of sensors,” says Scott Genzer, data scientist at RapidMiner (Boston, MA). “Getting the data is not the problem.  They’re usually drowning in it. The question is how to get the meaningful data.” 

The amount of data needed for a training dataset depends on the scope and complexity of the problem. For a simple project, a few hundred examples can be sufficient to produce very accurate results. Conversely, the Google quick response email project used a dataset of 238 million entries. A general rule of thumb is that the amount of data required is roughly an order of magnitude greater than the number of features being used for the model.2

One of the challenges of data collection in the industrial sector is often the limited number of examples demonstrating the issue of concern. Industrial equipment is built to last, so organizations developing predictive maintenance models may have to resort to digitizing historic records, adapting failure patterns from similar equipment, or running other types of models (see Getting Started with AI-Based Predictive Maintenance). 

Similarly, manufacturers also put great effort into maximizing yield, so the issue of collecting sufficient examples to form a representative dataset also needs to be taken into consideration for AVI applications.

Data Philosophy

To build a dataset for supervised machine learning, data needs to go through the following steps:

  • Extract – collected from various sources
  • Transform – converted into a consumable format
  • Load – moved into storage for easy access by users

The order and particulars of these operations vary depending on the data sources and the use cases. For structured data from consistent sources, the traditional approach has been to extract, transform, and load the data (ETL) into a relational database like a data warehouse. This works well for certain types of well-established and predictable business analytics. On the downside, it can’t be used with unstructured data like image files, text files, and video. An even bigger concern for ML applications is the fact that ETL constrains the size, contents, and form of the dataset from the very beginning.

In modern supervised ML, there’s no way to know in the beginning how the data will be used in the future. As a result, the trend has been toward extract, load, and transform (ELT) using a data lake. Data lakes are repositories of raw data, both structured and unstructured. The idea is that users can extract subsets and transform them as required for each particular application, leaving the raw data in the data lake for future use.

The ELT/data lake approach offers great flexibility in terms of data types and ability to serve the organization as needs evolve. On the downside, raw data increase the storage volume. This can get expensive, particularly if data is being uploaded to cloud storage. Performing some simple edge processing can be used to reduce data volume prior to loading.

Data Exploration

Before significant time and effort is put into data preparation, it’s useful to perform some exploratory analysis via histograms, box-and-whisker plots, scatterplots, etc. This gives a first look at the type and quality of data available. What’s the condition of the overall raw data? What will be required to make it usable? What types of insights can it provide? After this preliminary evaluation, we dive into data preparation, sometimes returning to exploratory data analysis to inform later steps.

Figure 1: Training datasets with inconsistently labeled data, like this one, can severely impact model accuracy. Here, the connection between scratch length and defect classification, as determined by different labelers, is seemingly random. (Courtesy of Landing AI)Data Labeling

Accurate data labeling is one of the most critical aspects of developing a training dataset for supervised ML. Garbage in, garbage out is a truism from the early days of computers. Because the model learns from the training data, results will only be as good as the labeling.

In the case of quantitative data such as that used for predictive maintenance, labeling can be straightforward—what is the percent change in vibration that can flag a cracked impeller blade on a pump? What is the change in current draw that can indicate a bearing defect? The key is that the engineering subject matter experts work with the data scientists map the data to the physical phenomena of interest.

Maintaining consistent definitions of what is acceptable and what is the flaw can be particularly challenging in the case of labeling for machine-vision applications such as AVI. How big does a chip or scratch have to be before the product is considered defective? The training dataset needs to present clear examples. Here, inconsistency can severely impact product quality – and model training (see Figure 1).

Figure 2: In this dataset, examples have been sorted by scratch length, with the length of the scratch determining whether the item is acceptable or defective. (Courtesy of Landing AI)Even expert labelers may disagree on what constitutes a defect but it’s essential to work toward consensus to create an effective training dataset. Quantifying features where possible, such as defining the length of a scratch that constitutes a defect, can be very helpful (see Figure 2). Another good solution to this problem is the establishment of a defect book—a collection of examples that clearly demonstrates what is considered acceptable and what is not.

Confirm Data Quality

As the Amazon example demonstrated, having a high-quality dataset is fundamental to the success of the supervised ML application. In fact, it’s better to have a smaller set of high-quality data than to have a large set of mixed quality data. Consider a set of training images for an AVI application. Images that are out of focus, incomplete, or low contrast can confuse the model and should be removed (see Figure 3).

Figure 3: For best results, an AVI training dataset should be culled down to only high-quality images. (Courtesy of Landing AI)“We believe that data capture is a critical piece in this whole application,” says Zhou. “So, we will consider the quality of data before we train a single model.” Indeed, improving data quality at the point of collection can significantly enhance overall model performance. “From what we’ve observed, you can spend, say, two months to improve your model to get your target level, but sometimes it will only take you maybe two weeks to improve your imaging system to reach a similar level of performance,” he adds.

The importance of data quality holds for the type of numerical data captured for predictive maintenance applications. Check for entries missing values or for any errors introduced during manual entry. Review the dataset for duplicate entries and corrupted data. These types of issues can be addressed during data cleansing but it’s important to evaluate upfront to be sure that they do not dominate the dataset.

Perhaps most important, confirm that the data you have is sufficient to address the business need. An initiative to cut down time by preventing failure of a motor at one pinch point may not be supported by capturing data from elsewhere on the machine.

Convert to a Consumable, Consistent Format

The data needs to be converted to a format that can be used by the model. This can be an issue when input comes from a variety of sources. It's not just a question of file format but of how the individual records in the dataset are expressed. What is the format for numerical representation in each column? Is there a set range for a given value that needs to be met? Are the labels consistent and do they need to be revised to make them more usable by the model?

Format inconsistency can be a particular challenge when sourced from a variety of sensors and devices. Output may be formatted and delivered by a data logger, or it may simply be a string of numbers that need to be converted into individual entries. It’s worth checking to see whether output formats can be harmonized at the data collection point to simplify this part of the process.

Data Cleansing

Data cleansing is a key step in addressing some of the quality issues discussed above. Duplicate entries, for example, can skew model performance, so they need to be identified and removed. Empty columns or columns with only one or two entries should be deleted. If an entry is missing one of its values, it may be possible to address the issue with the data imputation—filling it in with an average value, for example.

Data imputation should be used with caution, particularly in predictive maintenance applications, to avoid introducing misleading information that can cause inaccurate results. Similarly, a common step in data cleansing is to remove outliers, again, to prevent skew. For certain types of applications such as retail or hospitality, that’s a sound approach. In a predictive maintenance application, outliers are frequently essential to help the model to recognize anomalous behavior. These examples underscore the importance of domain expertise to the success of the supervised ML project.

Feature Engineering

Feature engineering refers to a set of techniques applied to data to create an optimal trading data set for a given problem. Feature engineering can be subdivided into several classes:

Feature Selection

Preliminary datasets typically contain a high number of data features, which can slow down processing and deliver poor results. Feature selection is the process of reducing the dimensions of the dataset. “We start with big data,” says Genzer. “There are very good ways to “chainsaw” it down, maybe from 3,000 columns to 300. Then, we can get serious. Generally, the other 2,700 columns are just noise.”

Simple examples are to remove irrelevant data types or data that is essentially redundant (e.g. the age of an asset and its date of installation). The focus should be on features that will assist with achieving the business goal, but it’s a process that needs to be approached with care. “You don’t want to just pick a few features because then you are biasing the algorithm based on what you think will be predictive, rather than letting the algorithm find what it thinks is predictive and then doing a sanity check afterward,” says Genzer. “This is particularly relevant in manufacturing, where your features are generally sensor readings and you usually don’t know where the ‘smoking gun’ lies.”

A variety of software tools exist to assist with feature selection. Data exploration helps guide the way, but ultimately the process requires domain expertise from the user’s subject-matter experts, working closely with the data scientists. The engineering team is best equipped to identify characteristics most likely to indicate a developing issue. Still, they need to guard against the temptation to completely control feature selection. The goal is to leave in reasonable data, then let the ML model do its work.

Feature Transformation

Feature transformation refers to techniques used to improve the representation of the dataset. At its most basic, it can simply involve renaming or reformatting data—changing the number of significant digits, altering the format for dates, etc. It can involve scaling or normalization to prevent extreme values from dominating the representation. Similarly, skewed data can be tamed via a log transformation.

Feature Generation

In some cases, existing features can be processed to create more useful information. Converting time-series vibration data into frequency space, for example, makes it possible to detect the emergence of higher-order harmonics that can indicate a defect. Correlating the temperature reading of a rooftop motor with the day’s high temperature could help the model recognize that the increase doesn’t represent a problem, particularly when compared to the asset’s history. Done properly, feature generation results in a final data so that is much more than the sum of the parts.


Data preparation provides the foundation to any supervised ML project. This has been a brief summary—the actual process can be resource intensive. “What we hear from our customers is that 80% of their time [on an ML project] is spent on consolidating data and exploratory data analysis and feature engineering,” says Kosti Vasilakakis, head of product growth, low code/no code ML, AWS. That said, the time invested in building a quality dataset will pay off in terms of speed of model deployment and quality of results.

Markets like retail, financial services, and healthcare make headlines for their use of AI. The industrial sector needs to follow. “With the speed of technology development, even if a process or production line is working today, that necessarily, that doesn't mean that it's going to work for the future,” says Salgado. “And especially with globalization, you can guarantee that your competitors are working on a solution or a product that's better, more innovative and probably more cost effective. AI is a tool at their disposal that they have to consider implementing.”


  1. Amazon scraps secret AI recruiting tool that showed bias against women 
  2. The Size and Quality of a Dataset, Google