Automated Software Engineering Benchmark
Jump to navigation
Jump to search
A Automated Software Engineering Benchmark is a automation benchmark that evaluates the performance of automated software engineering tools and methodologies in handling real-world software development tasks.
- Context:
- It can (typically) be designed to test the capabilities of language models, code generation tools, or automated programming assistants in solving or optimizing software engineering problems.
- It can (typically) involve community collaboration for the creation and validation of benchmark tasks and datasets.
- It can (often) include a set of predefined tasks that mimic real-world software engineering challenges, such as bug fixing, code optimization, or feature implementation.
- It can (often) be used to highlight the current limitations and future directions for research in automated software engineering.
- It can utilize metrics such as accuracy, efficiency, scalability, and generalizability to assess performance.
- It can provide a standardized way for researchers and developers to compare the effectiveness of different automated software engineering approaches.
- ...
- Example(s):
- Counter-Example(s):
- A Manual Software Engineering Test, which evaluates human programmers' abilities.
- General AI Benchmarks, which are not specifically designed for software engineering tasks.
- See: Automated Software Engineering, Benchmark, Language Model, Code Generation, Software Development Task.
References
2024
- (Jimenez et al., 2024) ⇒ Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. (2024). “SWE-bench: Can Language Models Resolve Real-world GitHub Issues?.” In: The Twelfth International Conference on Learning Representations. [1]
- QUOTE: SWE-bench is an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.
2023
- (Zöller & Huber., 2021) ⇒ Marc-André Zöller, and Marco F. Huber. (2021). “Benchmark and Survey of Automated Machine Learning Frameworks.” Journal of artificial intelligence research 70
- ABSTRACT: Machine learning (ML) has become a vital part in many aspects of our daily life. However, building well performing machine learning applications requires highly specialized data scientists and domain experts. Automated machine learning (AutoML) aims to reduce the demand for data scientists by enabling domain experts to build machine learning applications automatically without extensive knowledge of statistics and machine learning. This paper is a combination of a survey on current AutoML methods and a benchmark of popular AutoML frameworks on real data sets. Driven by the selected frameworks for evaluation, we summarize and review important AutoML techniques and methods concerning every step in building an ML pipeline. The selected AutoML frameworks are evaluated on 137 data sets from established AutoML benchmark suites.