Overview
Data leakage (or 'target leakage') happens when the model accidentally 'sees' the answer during training. This makes the model look perfect in testing but causes it to fail completely in the real world.
Common Causes
- Future Information: Including features that wouldn't be available at the time of prediction (e.g., using 'hospital discharge date' to predict 'length of stay').
- Train-Test Contamination: Accidentally including the same data points in both the training and testing sets.
- Preprocessing Errors: Calculating the mean or variance of the entire dataset before splitting it into train/test sets.
Prevention
Strict data splitting and careful feature engineering are essential to avoid leakage.