Bug-Fix Training Dataset
Jump to navigation
Jump to search
A Bug-Fix Training Dataset is a labeled training dataset containing paired buggy code samples and fixed code samples for training automated program repair systems.
- AKA: Program Repair Dataset, Code Fix Corpus, Bug-Patch Dataset.
- Context:
- It can typically contain Code Change Pairs with before-after versions.
- It can typically include Commit Metadata such as commit messages and bug descriptions.
- It can typically support Model Training for learning-based systems.
- It can often provide Test Cases for repair validation.
- It can often enable Pattern Mining for fix template extraction.
- ...
- It can range from being a Small Bug-Fix Dataset to being a Large-Scale Bug-Fix Dataset, depending on its sample size.
- It can range from being a Single-Language Bug-Fix Dataset to being a Multi-Language Bug-Fix Dataset, depending on its language coverage.
- It can range from being a Curated Bug-Fix Dataset to being a Automatically-Mined Bug-Fix Dataset, depending on its creation method.
- It can range from being a Simple-Bug Dataset to being a Complex-Bug Dataset, depending on its defect sophistication.
- ...
- It can be created through Repository Mining from version control systems.
- It can be processed by Data Preprocessing Tools for quality control.
- It can be evaluated using Benchmark Tasks for dataset effectiveness.
- ...
- Example(s):
- Defects4J Dataset containing Java bug-fix pairs.
- ManyBugs Dataset with C/C++ bug-fix pairs.
- GitHub Bug-Fix Dataset mined from open source repositorys.
- BugsInPy Dataset focusing on Python bug-fix pairs.
- Bears Dataset with continuous integration failures.
- QuixBugs Dataset containing algorithm bugs across multiple languages.
- BugSwarm Dataset with reproducible build failures.
- ...
- Counter-Example(s):
- Unlabeled Code Dataset, which lacks fix annotations.
- Single-Version Code Dataset, which contains only one code version.
- Test Dataset, which is used for evaluation not training.
- See: Labeled Training Dataset, Training Dataset, Software Engineering Dataset, Benchmark Dataset.