Opportunity

Data quality makes impactful projects feasible

Removing the feasibility bottleneck

Opportunity

To become a source of competitive advantage, an artificial intelligence (AI) or machine learning (ML) based project must be both impactful and feasible. Feasibility imposes economic and strategic limits on AI and ML projects. Improving data quality is extremely powerful because it can make new use cases feasible.

Figure 1 Deciding which ML projects to invest in and/or prioritize depends on impact and feasibility .¹

¹(Karayev)

Does more data always beat better algorithms?

Opportunity

In 2009, Google’s Director of Search Peter Norvig² explained how Google’s search engine was so effective: “We don’t have better algorithms. We just have more data."³ This quote describes why Google has been so successful, but it also highlights an often-forgotten truth about the history of deep learning: advancements have come from the data, not the algorithms.

Deep learning started gaining traction in 2006, and dramatic improvements in accuracy have enabled the accomplishment of increasingly complex tasks. But deep learning spans back to the ’40s (when it was known as cybernetics), and modern algorithms aren’t very different than what existed in the ’80s (then known as connectionism). The most important factor driving modern success in deep learning is the massive amount of data that is tracked and available due to digitization.⁴

Massive data isn’t always available, or inexpensive enough for commercial applications. This makes deep learning infeasible for many use cases, because in general, 5,000 labels are needed per category for a neural network to achieve a minimally acceptable baseline performance, and 10,000,000 for it to exceed human performance.⁶

So why focus on data quality rather than more data? To understand why data quality is often better, let’s look at the nature of the models and the business context for machine learning.

²Peter Norvig is Google’s director of research and former director of search quality. He co-wrote the Artificial Intelligence: A Modern Approach a foundational text in the study.
³(Halvey, Norvig and Peter)
⁴(Goodfellow, Bengio and Courville 12 -15, 18 - 21, 24 - 25)
⁵Representation capacity is the measure of a model’s potential to fit a large complexity of functional forms (Goodfellow, Bengio and Courville 19)
⁶(Goodfellow, Bengio and Courville 20 - 21)

Figure 2 Increasing benchmark dataset size over time.⁵

Opportunity

The nature of data and models: the bias-variance tradeoff

Improvements in data quality help to combat the bias-variance tradeoff. This is because cleaner data allows data teams to do more with less. Let’s consider why.

In statistical learning theory, observed data results from some unknown data generating process. When data is gathered, patterns can be modeled to approximate the data generating process via induction.

Inductively generated models can never be completely accurate, due to error introduced from bias and variance. Error from bias occurs when a model has low representation capacity and cannot represent complex relationships in the data. This is known as underfitting, since the model lacks the flexibility to fit large variations in the data.

Data teams can increase representation by introducing more variables or by using more complex algorithms. In doing so, they can, in theory, make a model flexible enough to fit almost any level of complication.

Opportunity

The nature of data and models: the bias-variance tradeoff
(continued)

But in the real world, all data contains noise from measurement error. Noise introduces error from variance. When a data team makes a model more complex so that it can accommodate more variables, they are also making it more suited to fit variations due to noise. This results in overfitting: because the model’s predictions are based significantly on noise, the model does not generalize to new predictions.

The bias-variance tradeoff cannot be avoided because a portion of the variance is irreducible. ⁷ But data teams can minimize various by thoroughly cleaning the data. When the data has been thoroughly cleaned, the team can effectively deploy a more accurate model than the size of the dataset would normally permit.

⁷If the process is stochastic in nature this error is irreducible because some portion is truly random and fits no pattern or relationship. If the process is deterministic then error is irreducible because measurement instruments cannot be totally precise. There are theoretical debates as to what type of process governs the universe, but this only matter if you have material business in existential markets.

Opportunity

Business context: to finity and beyond when more data isn’t an option

Data quality is paramount to developing feasible models with practical applications in the real world. Why?

A theoretical approach might try to optimize the bias-variance tradeoff by training the most complex algorithm on massive datasets. But this only works if you have access to infinite data. In the real world, datasets are finite. Most companies aren’t Google, and don’t have nearly enough data to use the most complex algorithms available. Even if there is a large dataset, the class of interest may represent a small proportion of the dataset, leading to similar challenges as a small dataset.⁸

Even if an organization can obtain more data, the cost of acquiring it could outweigh the performance benefits.

Data teams can create a stronger signal with less data by cleaning noisy labels and correcting errors. Cleaning data can also allow teams to increase the effective size of a dataset, by allowing them to process data points that may have been unusable due to corruption. Improving data quality also helps teams avoid errors caused by biased data.

Good data increases the number impactful goals ML can feasibly achieve, particularly where volume is limited.¹⁰

Figure 3 McKinsey forecast of potential Impact of artificial intelligence by industry through 2030 ^₉

⁸E.g., In medical imaging predictions, most images will be of healthy patients; in fraud detection, most transactions will be legitimate.
⁹Note most new value will be in fields outside of tech, where labeling and dataset size will be a significant bottleneck, especially for industries such as healthcare where expert annotation can be expensive.
¹⁰(Ng, A Chat with Andrew on MLOps: From Model-centric to Data-centric AI)