BrowseComp Benchmark

From GM-RKB

Jump to navigation Jump to search

A BrowseComp Benchmark is a web agent benchmark that evaluates autonomous web browsing and navigation capabilities through realistic task completion scenarios.

AKA: Browse Comp, Web Browsing Competition Benchmark, Browser Agent Benchmark, Web Navigation Benchmark.
Context:
- It can typically evaluate BrowseComp Navigation Tasks including form filling, information extraction, and multi-page traversal.
- It can typically measure BrowseComp Success Metrics through task completion rates and action efficiency scores.
- It can typically provide BrowseComp Test Environments with real websites and simulated interfaces.
- It can typically assess BrowseComp Agent Robustness against dynamic content and website variations.
- It can often benchmark Multi-Step Web Tasks requiring sequential planning.
- It can often support Language-Specific Evaluations across different locales.
- It can often identify Navigation Strategy Differences between agent implementations.
- It can range from being a Simple BrowseComp Benchmark to being a Complex BrowseComp Benchmark, depending on its task difficulty.
- It can range from being an English BrowseComp Benchmark to being a Multilingual BrowseComp Benchmark, depending on its language support.
- It can range from being a Static BrowseComp Benchmark to being a Dynamic BrowseComp Benchmark, depending on its website update.
- It can range from being a Single-Site BrowseComp Benchmark to being a Multi-Site BrowseComp Benchmark, depending on its domain coverage.
- ...
Example(s):
- BrowseComp Evaluation Results, such as:
  - WebSailor-V2-30B-A3B BrowseComp Score, demonstrating state-of-art performance.
  - Tongyi DeepResearch Agent BrowseComp Performance, showing navigation capability.
  - Baseline Agent BrowseComp Score, establishing comparison metric.
- BrowseComp Task Types, such as:
  - Shopping Task BrowseComp, requiring product search and checkout.
  - Information Gathering BrowseComp, needing data extraction.
- ...
Counter-Example(s):
- Static Web Scraping Benchmark, which lacks interactive navigation.
- API Testing Benchmark, which bypasses visual interface.
- Single-Action Benchmark, which omits sequential planning.
See: Web Agent Benchmark, Web Navigation Task, WebSailor-V2-30B-A3B Model, Agent Evaluation Framework, Humanity's Last Exam (HLE) Benchmark, Web Automation System, Browser Automation, Task Completion Metric, Multi-Step Planning.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=BrowseComp_Benchmark&oldid=977792"