Automated Software Engineering Benchmark

Context:
- It can (typically) be designed to test the capabilities of language models, code generation tools, or automated programming assistants in solving or optimizing software engineering problems.
- It can (typically) involve community collaboration for the creation and validation of benchmark tasks and datasets.
- It can (often) include a set of predefined tasks that mimic real-world software engineering challenges, such as bug fixing, code optimization, or feature implementation.
- It can (often) be used to highlight the current limitations and future directions for research in automated software engineering.
- It can utilize metrics such as accuracy, efficiency, scalability, and generalizability to assess performance.
- It can provide a standardized way for researchers and developers to compare the effectiveness of different automated software engineering approaches.
- ...
Example(s):
- SWE-bench, which assesses the ability of models to solve real-world GitHub issues.
- HumanEval, focusing on generating code from natural language descriptions within a single function.
- ...
Counter-Example(s):
- A Manual Software Engineering Test, which evaluates human programmers' abilities.
- General AI Benchmarks, which are not specifically designed for software engineering tasks.
See: Automated Software Engineering, Benchmark, Language Model, Code Generation, Software Development Task.

References

(Jimenez et al., 2024) ⇒ Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. (2024). “SWE-bench: Can Language Models Resolve Real-world GitHub Issues?.” In: The Twelfth International Conference on Learning Representations. [1]
- QUOTE: SWE-bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.

(Zöller & Huber., 2021) ⇒ Marc-André Zöller, and Marco F. Huber. (2021). “Benchmark and Survey of Automated Machine Learning Frameworks.” Journal of artificial intelligence research 70
- ABSTRACT: Machine learning (ML) has become a vital part in many aspects of our daily life. However, building well performing machine learning applications requires highly specialized data scientists and domain experts. Automated machine learning (AutoML) aims to reduce the demand for data scientists by enabling domain experts to build machine learning applications automatically without extensive knowledge of statistics and machine learning. This paper is a combination of a survey on current AutoML methods and a benchmark of popular AutoML frameworks on real data sets. Driven by the selected frameworks for evaluation, we summarize and review important AutoML techniques and methods concerning every step in building an ML pipeline. The selected AutoML frameworks are evaluated on 137 data sets from established AutoML benchmark suites.