Recommendations

Putting data quality improvements into practice

Recommendation

Inspect and prepare dataset thoroughly

“It’s a common joke that 80 percent of machine learning is actually data cleaning, as though that were a lesser task. My view is that if 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team." ¹⁷

Machine learning datasets tend to be complex and high dimensional. Knowing what to clean and what features to engineer requires teams to explore datasets with a variety of summary statistics and visualizations to identify erroneous data that may seem normal outside the context of other data.

Preparing a dataset reduces noise in the data directly and feature engineering can improve predictive power while reducing complexity. This reduces the amount of data needed to accurately identify the form of the data.

Figure 4 Anscombe's Quartet illustrates the importance of thoroughly exploring and understanding a dataset. All 4 of these sets have the same mean, variance, and correlation between x and y.

¹⁷(Ng, The Batch: Issue 84)

Recommendation

Match model complexity to volume and variability of the data

Use an appropriate level of model capacity for the complexity of the task being performed. Increase capacity as the nature of the phenomenon being modeled becomes more complex by adding features or choosing more flexible algorithms. If the phenomenon is less complex¹⁹ or the dataset is small, use more biased methods to avoid overfitting.²⁰

¹⁹Although if the task or phenomenon isn’t complex you should probably reconsider whether you should be using machine learning—along with your life choices—because somewhere a statistician is crying.
²⁰(Goodfellow, Bengio and Courville 110)

Recommendation

Employ strategic data collection and labeling in your sourcing

According to a 2019, survey 96% of enterprises encounter data quality and labeling challenges.²¹ Creating well defined labels is critical for preventing edge case ambiguity, particularly when using internal labelers or outsourcing to a lower quality external labeler.

If it’s not prohibitively expensive, data teams should use multiple labelers for each observation. Using multiple labelers smooths noisiness in labeling because majority opinion will be stable in the aggregate.

Using multiple labelers also indicates where labeling parameters can be made more explicit by identifying cases with disagreement. Data teams should track the sources of labels and should look for opportunities to improve performance by reducing the weight of lower quality labels.

Bias often occurs at the point of data collection. While it can be difficult to identify, there are some strategies to mitigate its influence.

Labeling data limits the feasibility of ML projects. Labeling poses challenges due to the inaccuracy inherent to human judgement, and the expense needed to acquire enough high-quality labels. While no adequate automated solution exists, some techniques can be used to make the process more efficient and less expensive, even if expensive subject matter experts (SMEs) are required.²²

Recommendation

Employ strategic data collection and labeling in your sourcing (cont.)

²¹(Dimensional Research)

²²E.g. doctors needed to labeldisease in medical imaging datasets
²³(Burkov, The Hundred-Page Machine Learning Book 91)
²⁴(Burkov, The Hundred-Page Machine Learning Book 91)
²⁵(Burkov, The Hundred-Page Machine Learning Book 102 - 103)