Agentic System Golden Set Dataset

From GM-RKB

Jump to navigation Jump to search

An Agentic System Golden Set Dataset is an evaluation dataset that contains representative task snapshots and expected trajectorys for agentic system regression testing.

AKA: Golden Dataset for Agents, Agent Benchmark Dataset, Regression Test Reference Dataset, Canonical Agent Test Set.
Context:
- It can typically preserve agent interaction trajectorys with environmental states and decision sequences for replay testing.
- It can typically include diverse scenario coverage spanning edge cases, typical usage patterns, and failure modes.
- It can typically maintain version-controlled snapshots with expected outputs and acceptable variance thresholds.
- It can often incorporate human-validated responses as ground truth references for quality assessment.
- It can often support incremental dataset growth through production trace sampling with curation processes.
- It can often enable performance baseline establishment for regression detection and improvement measurement.
- It can range from being a Small Golden Set to being a Comprehensive Golden Set, depending on its coverage scope.
- It can range from being a Static Golden Set to being an Evolving Golden Set, depending on its update frequency.
- It can range from being a Single-Domain Golden Set to being a Multi-Domain Golden Set, depending on its application breadth.
- It can range from being a Synthetic Golden Set to being a Production-Derived Golden Set, depending on its data source.
- ...
Examples:
- LLM Agent Golden Datasets, such as:
  - ChatGPT Interaction Golden Set with conversation historys and expected completions.
  - Code Generation Golden Dataset containing programming tasks with verified solutions.
- RAG System Golden Datasets, such as:
  - Question-Answer Pair Golden Set with relevant documents and expected retrievals.
  - Multi-Hop Reasoning Golden Dataset for complex query validation.
- Task-Specific Golden Datasets, such as:
  - Customer Service Agent Golden Set with support tickets and resolution paths.
- ...
Counter-Examples:
- Random Test Dataset, which lacks curation and quality validation.
- Training Dataset, which serves model development rather than regression testing.
- Synthetic Benchmark, which may not reflect real-world scenarios.
See: Golden Dataset, Regression Testing, Test Dataset Management, Agentic System Regression Testing Task, Benchmark Dataset, Evaluation Dataset, Test Data Curation.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Agentic_System_Golden_Set_Dataset&oldid=969633"