Overview
Synthetic data is used when real data is scarce, expensive, or too sensitive to use (e.g., medical or financial records).
How it's Created
- Generative Models: Using VAEs, GANs, or Diffusion models to learn the distribution of real data and sample new points.
- Simulators: Using physics engines or rule-based systems to generate data (common in robotics and autonomous driving).
Benefits
- Privacy: Eliminates the risk of leaking personal information.
- Scale: Can generate millions of examples for a fraction of the cost of manual collection.
- Bias Correction: Can be used to 'balance' datasets by generating more examples of underrepresented groups.