BrowseComp Benchmark
Jump to navigation
Jump to search
A BrowseComp Benchmark is a web agent benchmark that evaluates autonomous web browsing and navigation capabilities through realistic task completion scenarios.
- AKA: Browse Comp, Web Browsing Competition Benchmark, Browser Agent Benchmark, Web Navigation Benchmark.
- Context:
- It can typically evaluate BrowseComp Navigation Tasks including form filling, information extraction, and multi-page traversal.
- It can typically measure BrowseComp Success Metrics through task completion rates and action efficiency scores.
- It can typically provide BrowseComp Test Environments with real websites and simulated interfaces.
- It can typically assess BrowseComp Agent Robustness against dynamic content and website variations.
- It can often benchmark Multi-Step Web Tasks requiring sequential planning.
- It can often support Language-Specific Evaluations across different locales.
- It can often identify Navigation Strategy Differences between agent implementations.
- It can range from being a Simple BrowseComp Benchmark to being a Complex BrowseComp Benchmark, depending on its task difficulty.
- It can range from being an English BrowseComp Benchmark to being a Multilingual BrowseComp Benchmark, depending on its language support.
- It can range from being a Static BrowseComp Benchmark to being a Dynamic BrowseComp Benchmark, depending on its website update.
- It can range from being a Single-Site BrowseComp Benchmark to being a Multi-Site BrowseComp Benchmark, depending on its domain coverage.
- ...
- Example(s):
- BrowseComp Evaluation Results, such as:
- WebSailor-V2-30B-A3B BrowseComp Score, demonstrating state-of-art performance.
- Tongyi DeepResearch Agent BrowseComp Performance, showing navigation capability.
- Baseline Agent BrowseComp Score, establishing comparison metric.
- BrowseComp Task Types, such as:
- Shopping Task BrowseComp, requiring product search and checkout.
- Information Gathering BrowseComp, needing data extraction.
- ...
- BrowseComp Evaluation Results, such as:
- Counter-Example(s):
- Static Web Scraping Benchmark, which lacks interactive navigation.
- API Testing Benchmark, which bypasses visual interface.
- Single-Action Benchmark, which omits sequential planning.
- See: Web Agent Benchmark, Web Navigation Task, WebSailor-V2-30B-A3B Model, Agent Evaluation Framework, Humanity's Last Exam (HLE) Benchmark, Web Automation System, Browser Automation, Task Completion Metric, Multi-Step Planning.