Overview

Synthetic data is used when real data is scarce, expensive, or too sensitive to use (e.g., medical or financial records).

How it's Created

  • Generative Models: Using VAEs, GANs, or Diffusion models to learn the distribution of real data and sample new points.
  • Simulators: Using physics engines or rule-based systems to generate data (common in robotics and autonomous driving).

Benefits

  • Privacy: Eliminates the risk of leaking personal information.
  • Scale: Can generate millions of examples for a fraction of the cost of manual collection.
  • Bias Correction: Can be used to 'balance' datasets by generating more examples of underrepresented groups.

Related Terms