There's a hidden killer in your data.
Silently waiting for you to train your machine learning model. Deceiving you even in your validation data with great performance. Cleverly hiding from your statistical tests and automated procedures.
Suddenly, when you release your fledgling model into the real world it strikes!
Data leakage the silent killer
Your data set somehow had leaky information.
Information that the model found and directly used to predict the target with high accuracy. But once it encounters real-world data, this information either isn't available or not as reliable of a predictor!
How you say?
Static datasets are snapshots
Your training dataset was created somehow. (Well, duh)
How you create the dataset can have an impact on leaky variables. But there are other ways to leak information.
Let's look at examples of a leaky variable first:
- The Statoil iceberg prediction challenge contained the satellite angle, which could be exploited to predict icebergs vs. ships.
- The deepfake competition did not contain aspect ratio metadata for generated video data.
- A hackathon dataset I used contained an obscure column with predictions from a model from a different competition predicting the target.
Your test data is sacrosanct
But it gets worse.
I just worked on predicting some bitcoin prices and good validation saved me. Normalizing the data only on the training set, showed that XGBoost completely failed on the test set when values surpassed the maximum price of the training data.
If I had leaked the maximum of the test set instead, I would've gone ahead with XGBoost as the best and fastest model due to test data leakage!
Beware of data leakage! It can hide anywhere!