Overview
The quality and diversity of training data are the most important factors in an AI model's performance. 'Garbage in, garbage out' is a fundamental rule in AI development.
Types
- Labeled Data: Used in supervised learning (e.g., images with tags).
- Unlabeled Data: Used in self-supervised learning (e.g., raw text from the internet).
Challenges
- Bias: If the training data is biased, the model will be biased.
- Copyright: Legal issues surrounding the use of web-scraped data.