Machine Learning Data Leakage
(Redirected from Data Leakage)
Jump to navigation
Jump to search
A Machine Learning Data Leakage is an ml evaluation pitfall that is a data contamination issue where evaluation data overlaps with training data (causing inflated performance metrics).
- AKA: Data Leakage, Train-Test Leakage, Information Leakage, Data Contamination, Evaluation Data Leak.
- Context:
- It can typically occur through Direct Data Duplication when identical samples appear in both training set and test set.
- It can typically manifest via Feature Leakage when target-correlated features contain future information or target encoding.
- It can typically arise from Preprocessing Leakage when data transformations use global statistics before data splitting.
- It can typically result from Temporal Leakage when time-series data violates temporal ordering in train-test split.
- It can typically emerge through Group Leakage when related observations appear across dataset partitions.
- ...
- It can often involve Validation Set Contamination through repeated experiments, information transfer, and adaptive overfitting.
- It can often include Cross-Validation Leakage via improper fold creation, data point dependency, and stratification error.
- It can often contain Synthetic Data Leakage when generated samples retain training data characteristics.
- It can often exhibit Benchmark Contamination when public test sets appear in pretraining data.
- It can often display Prompt Leakage in llm evaluation when example prompts contain test answers.
- ...
- It can range from being Subtle Machine Learning Data Leakage to being Obvious Machine Learning Data Leakage, depending on its leakage detectability.
- It can range from being Minor Machine Learning Data Leakage to being Severe Machine Learning Data Leakage, depending on its leakage performance impact.
- It can range from being Accidental Machine Learning Data Leakage to being Systematic Machine Learning Data Leakage, depending on its leakage occurrence pattern.
- It can range from being Direct Machine Learning Data Leakage to being Indirect Machine Learning Data Leakage, depending on its leakage mechanism.
- It can range from being Partial Machine Learning Data Leakage to being Complete Machine Learning Data Leakage, depending on its leakage data proportion.
- ...
- It can be detected by Leakage Detection Methods through distribution analysis, duplicate checking, and performance anomaly detection.
- It can be prevented by Proper Split Protocols using temporal splitting, grouped splitting, and stratified sampling.
- It can be avoided by Pipeline Best Practices via separate preprocessing, cross-validation nesting, and holdout isolation.
- It can be documented in Evaluation Reports through data handling description, split methodology, and contamination check.
- ...
- Example(s):
- Target Leakage Examples, such as:
- Hospital Readmission Leakage when discharge notes contain readmission information.
- Credit Default Leakage when account closure indicates future default.
- Customer Churn Leakage when cancellation fields leak churn label.
- Preprocessing Leakage Examples, such as:
- Normalization Leakage using full dataset mean before train-test split.
- Imputation Leakage filling missing values with test set information.
- Feature Selection Leakage using entire dataset for feature importance.
- Temporal Leakage Examples, such as:
- Stock Prediction Leakage using future prices in technical indicators.
- Time Series Leakage with lookahead bias in rolling window.
- Event Prediction Leakage including post-event data in training.
- Duplication Leakage Examples, such as:
- Image Augmentation Leakage with augmented versions across splits.
- Text Duplicate Leakage from paraphrases in different partitions.
- Synthetic Oversampling Leakage with smote samples in test set.
- LLM-Specific Leakage Examples, such as:
- Benchmark Contamination when test questions appear in pretraining corpus.
- Few-Shot Leakage when prompt examples contain test answer patterns.
- Memorization Leakage when model memorizes evaluation dataset.
- ...
- Target Leakage Examples, such as:
- Counter-Example(s):
- Clean Data Split, which maintains strict separation between training data and evaluation data.
- Proper Temporal Split, which respects time ordering and prevents future information leak.
- Valid Cross-Validation, which ensures proper fold independence and no information sharing.
- Isolated Test Set, which remains completely unseen during model development.
- See: ML Evaluation Pitfall, Data Contamination, Train-Test Split, Cross-Validation, Overfitting, Evaluation Metric, Machine Learning Evaluation, Temporal Validation, Feature Engineering, LLM Evaluation Method, Benchmark Contamination.