Machine Learning Data Leakage

From GM-RKB

(Redirected from Data Leakage)

Jump to navigation Jump to search

A Machine Learning Data Leakage is an ml evaluation pitfall that is a data contamination issue where evaluation data overlaps with training data (causing inflated performance metrics).

AKA: Data Leakage, Train-Test Leakage, Information Leakage, Data Contamination, Evaluation Data Leak.
Context:
- It can typically occur through Direct Data Duplication when identical samples appear in both training set and test set.
- It can typically manifest via Feature Leakage when target-correlated features contain future information or target encoding.
- It can typically arise from Preprocessing Leakage when data transformations use global statistics before data splitting.
- It can typically result from Temporal Leakage when time-series data violates temporal ordering in train-test split.
- It can typically emerge through Group Leakage when related observations appear across dataset partitions.
- ...
- It can often involve Validation Set Contamination through repeated experiments, information transfer, and adaptive overfitting.
- It can often include Cross-Validation Leakage via improper fold creation, data point dependency, and stratification error.
- It can often contain Synthetic Data Leakage when generated samples retain training data characteristics.
- It can often exhibit Benchmark Contamination when public test sets appear in pretraining data.
- It can often display Prompt Leakage in llm evaluation when example prompts contain test answers.
- ...
- It can range from being Subtle Machine Learning Data Leakage to being Obvious Machine Learning Data Leakage, depending on its leakage detectability.
- It can range from being Minor Machine Learning Data Leakage to being Severe Machine Learning Data Leakage, depending on its leakage performance impact.
- It can range from being Accidental Machine Learning Data Leakage to being Systematic Machine Learning Data Leakage, depending on its leakage occurrence pattern.
- It can range from being Direct Machine Learning Data Leakage to being Indirect Machine Learning Data Leakage, depending on its leakage mechanism.
- It can range from being Partial Machine Learning Data Leakage to being Complete Machine Learning Data Leakage, depending on its leakage data proportion.
- ...
- It can be detected by Leakage Detection Methods through distribution analysis, duplicate checking, and performance anomaly detection.
- It can be prevented by Proper Split Protocols using temporal splitting, grouped splitting, and stratified sampling.
- It can be avoided by Pipeline Best Practices via separate preprocessing, cross-validation nesting, and holdout isolation.
- It can be documented in Evaluation Reports through data handling description, split methodology, and contamination check.
- ...
Example(s):
- Target Leakage Examples, such as:
  - Hospital Readmission Leakage when discharge notes contain readmission information.
  - Credit Default Leakage when account closure indicates future default.
  - Customer Churn Leakage when cancellation fields leak churn label.
- Preprocessing Leakage Examples, such as:
  - Normalization Leakage using full dataset mean before train-test split.
  - Imputation Leakage filling missing values with test set information.
  - Feature Selection Leakage using entire dataset for feature importance.
- Temporal Leakage Examples, such as:
  - Stock Prediction Leakage using future prices in technical indicators.
  - Time Series Leakage with lookahead bias in rolling window.
  - Event Prediction Leakage including post-event data in training.
- Duplication Leakage Examples, such as:
  - Image Augmentation Leakage with augmented versions across splits.
  - Text Duplicate Leakage from paraphrases in different partitions.
  - Synthetic Oversampling Leakage with smote samples in test set.
- LLM-Specific Leakage Examples, such as:
  - Benchmark Contamination when test questions appear in pretraining corpus.
  - Few-Shot Leakage when prompt examples contain test answer patterns.
  - Memorization Leakage when model memorizes evaluation dataset.
- ...
Counter-Example(s):
- Clean Data Split, which maintains strict separation between training data and evaluation data.
- Proper Temporal Split, which respects time ordering and prevents future information leak.
- Valid Cross-Validation, which ensures proper fold independence and no information sharing.
- Isolated Test Set, which remains completely unseen during model development.
See: ML Evaluation Pitfall, Data Contamination, Train-Test Split, Cross-Validation, Overfitting, Evaluation Metric, Machine Learning Evaluation, Temporal Validation, Feature Engineering, LLM Evaluation Method, Benchmark Contamination.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Machine_Learning_Data_Leakage&oldid=963752"