SWE-Bench Benchmark

A SWE-Bench Benchmark is a automated software engineering benchmark of real-world GitHub issues.

Context:
- It can (typically) involve providing a language model with a codebase and an issue, where the model must generate a patch that resolves the described problem.
- It can (often) include a set of Issue-Pull Request Pairs from Python repositories.
- Itcan perform evaluations by unit test verification using post-PR behavior as the reference solution.
- It can (often) challenge models to generate revisions across multiple locations in a large codebase without providing explicit guidance (such as limiting the scope to a function or class or providing fill-in-the-blank style prompts).
- It can (often) serve as a level playing field to compare approaches ranging from retrieval and long-context models to decision-making agents, which could reason and act in code.
- It can be updated regularly with new tasks sourced from active GitHub repositories, making it a living benchmark that evolves alongside advancements in software engineering and AI-driven code generation.
- ...
Example(s):
- an SWE-Bench Lite example from March 2024, where the benchmark was reduced to 300 instances for efficient evaluation across models with lower computational resources.
- June 2024, where models were evaluated on human-annotated tasks, ensuring a known correct solution for all instances.
- An evaluation in 2023, where GPT-4 and Claude 2 were able to resolve only 1.7% and 4.8% of the tasks.
- ... with 2,294 issue-pull request pairs from 12 repositories.
- ...
Counter-Example(s):
- MLE-bench.
- Human Coding Skill Tests.
- Code Editing Benchmarks that limit the scope to a single function or class.
- Cloze-style Fill-in-the-Blank Prompt Benchmarks (for code generation).
See: Language Model, GitHub Issue, Patch Generation, Unit Test Verification, Python Repository.

References

2024-08-20

https://openai.com/index/gpt-4o-fine-tuning/
- NOTES
  - Performance Boost with Fine-Tuned Models: The fine-tuning capability of GPT-4o has been instrumental in significantly improving model performance on SWE-Bench. Cosine's Genie achieving a quantitative lift in performance attributed to fine-tuning GPT-4o on the SWE-Bench benchmark is substantial. Specifically, the fine-tuned version of GPT-4o, as implemented by Cosine's Genie, achieved a 43.8% resolution rate on the SWE-Bench Verified benchmark. This significantly improved from its previous state-of-the-art (SOTA) score of 19.27% on SWE-Bench Full. The lift in performance due to fine-tuning is, therefore, a remarkable 24.53 percentage points on the full benchmark.
  - Customization for Domain-Specific Tasks: OpenAI's fine-tuning feature allows developers to train GPT-4o models on domain-specific data, such as software engineering scenarios covered in SWE-Bench. This customization enables the model to better understand and generate code that resolves intricate issues across diverse Python repositories, improving accuracy and reducing the need for manual intervention (Swebench).
  - SWE-Bench as a Benchmark for Fine-Tuning Efficacy: SWE-Bench serves as a rigorous testbed for evaluating the efficacy of fine-tuning strategies on large language models. The recent success of fine-tuned GPT-4o models on SWE-Bench indicates that such benchmarks are crucial for pushing the boundaries of AI capabilities in real-world applications, particularly in fields like software development (Swebench) (Swebench).

2024

https://www.swebench.com/
- QUOTE: You can download the SWE-bench task instances from HuggingFace or directly as a JSON file (development, test sets). For your convenience, to fine tune your own model for evaluation on SWE-bench, we provide five pre-processed datasets at different retrieval settings ("Oracle", 13K, 27K, 40K, 50K "Llama"). We recommend using the 13K, 27K, or 40K datasets for evaluation. The 50K "Llama" dataset is provided for reproducing the results of the SWE-bench paper.
- We also provide the full SWE-Llama model weights at 13b and 7b parameters, along with their PEFT LoRA weights.
- About: SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

2023

(Jimenez et al., 2023) ⇒ Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. (2023). “Swe-bench: Can Language Models Resolve Real-world Github Issues?. ” arXiv preprint arXiv:2310.06770
- ABSTRACT: Language models have outpaced our ability to evaluate them effectively, but for their future development, it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts, and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
- NOTES:
  - SWE-bench Introduction: Introduced by Carlos E. Jimenez and colleagues, SWE-bench is an evaluation framework designed to assess the capability of language models in solving real-world software engineering problems, drawing on issues and solutions from GitHub repositories related to Python programming .
  - Real-World Application Focus: Unlike many existing benchmarks that are either too theoretical or confined to synthetic datasets, SWE-bench is based on 2,294 real software engineering problems from 12 popular Python repositories, making it uniquely positioned to test language models against genuine coding challenges .
  - Task Complexity and Scope: The tasks in SWE-bench range widely in complexity, requiring models to understand and modify codebases, interact with execution environments, and perform complex reasoning across multiple functions, classes, and files, challenging the limits of current code generation and modification capabilities .
  - Comprehensive Evaluation Metrics: The framework evaluates models not just on their ability to generate code but on their effectiveness in producing viable solutions that compile and pass unit tests, offering a more holistic measure of a model's practical utility in software development .
  - Comparison to Related Benchmarks: SWE-bench sets itself apart from benchmarks like HumanEval, which focuses on generating code from natural language descriptions within a single function, by its emphasis on real-world GitHub issues that often require understanding a larger codebase and making multi-file changes .
  - SWE-Llama Fine-tuning: To test the benchmark, the authors introduced SWE-Llama, a model fine-tuned specifically on SWE-bench-like tasks. This approach allows for direct comparison against state-of-the-art models like Claude 2 and GPT-4, providing insights into the effectiveness of fine-tuning on specialized tasks .
  - Performance Insights: Initial results indicate that even advanced models struggle with the benchmark, solving a small percentage of the total problems. This highlights the gap between current language model capabilities and the demands of real-world software engineering, suggesting significant room for improvement .
  - Reproducibility and Open Science Commitment: The authors have made the SWE-bench dataset, training details, and model outcomes available to the research community, encouraging further experimentation, verification, and advancement in the field of AI-driven code generation and modification .
  - Potential for Continuous Learning and Update: Given its foundation in active GitHub repositories, SWE-bench can be continually updated with new tasks, making it a living benchmark that evolves alongside both the software engineering domain and advancements in language model technology .

SWE-Bench Benchmark

References

2024-08-20

2024

2023

Navigation menu

Search