Biased Training Dataset
- Biased data distribution
This occurs when your training data is not truly representative of the population that your product seeks to serve. Think carefully about how your data was collected. For example, if you have a dataset of user-submitted photos and you filter it for image clarity this could skew your data by over-representing users that have expensive cameras. In general, consider how your data is distributed with respect to the groups of users your product will serve. Do you have enough data for each relevant group? There are often subtle, systemic reasons why your dataset might not capture the full diversity of your use case in the real world.
To mitigate this, you could try to acquire data from multiple sources, or filter data carefully to ensure you only take the most useful examples from overrepresented groups.
- Biased data representation
It's possible that you have an appropriate amount of data for every demographic group you can think of but that some groups are represented less positively than others. Consider a dataset of microblog posts about actors. It might be the case that you did a great job collecting a 50-50 split of male and female performers, but when you dig into the content, posts about female performers tend to be more negative than those about male performers. This could lead your model to learn some form of gender bias. For some applications, however, different representations between group may not be a problem. In medical classification, for instance, it's important to capture subtle demographic differences to make more accurate diagnoses. But for other applications, biased negative associations may have financial or educational repercussions, limit economic opportunity, and cause emotional and mental anguish.
Consider hand-reviewing your data for these negative associations if it's feasible, or applying rule-based filters to remove negative representations if you think it's right for your application.
- Biased labels
An essential step in creating training data for AutoML is labeling your data with relevant categories. Minimizing bias in these labels is just as important as ensuring your data is representative. Understand who your labelers are. Where are they located? What languages do they speak natively? What ages and genders are they? Homogeneous rater pools can yield labels that are incorrect or skewed in ways that might not be immediately obvious.
Ideally, make sure your labelers are experts in your domain or give instructions to train them on relevant aspects, and have a secondary review process in place to spot-check label quality. Aim to optimize for objectivity over subjectivity in decision-making. Training labelers on “unconscious bias” has also been shown to help improve the quality of labels with respect to diversity goals. Finally, allowing labelers to self-report issues and ask clarifying questions about instructions can also help minimize bias in the labeling process.