According to a 2019, survey 96% of enterprises encounter data quality and labeling challenges.21 Creating well defined labels is critical for preventing edge case ambiguity, particularly when using internal labelers or outsourcing to a lower quality external labeler.
If it’s not prohibitively expensive, data teams should use multiple labelers for each observation. Using multiple labelers smooths noisiness in labeling because majority opinion will be stable in the aggregate.
Using multiple labelers also indicates where labeling parameters can be made more explicit by identifying cases with disagreement. Data teams should track the sources of labels and should look for opportunities to improve performance by reducing the weight of lower quality labels.
Bias often occurs at the point of data collection. While it can be difficult to identify, there are some strategies to mitigate its influence.