BigLaw Bench

A BigLaw Bench is a proprietary real-world complex legal AI benchmark that can support legal AI evaluation tasks by Harvey.ai.

AKA: BigLaw Bench Benchmark, Harvey BigLaw Bench, Harvey Legal AI Benchmark.
Context:
- It can typically evaluate Legal AI System performance through legal time entry-based tasks derived from actual billable legal work.
- It can typically measure Legal Answer Quality by assessing lawyer-quality work product completion percentages against custom legal rubrics.
- It can typically assess Legal Source Reliability by measuring legal reference accuracy and legal citation verifiability.
- It can typically benchmark Legal Foundation Models on legal document drafting, legal reasoning, and legal risk assessment.
- It can typically provide Legal Performance Metrics through legal answer scores and legal source scores.
- ...
- It can often evaluate Legal Transactional Tasks including legal due diligence, legal contract analysis, and legal transaction structuring.
- It can often assess Legal Litigation Tasks including legal argument construction, legal document drafting, and legal case analysis.
- It can often focus on Legal Practice Areas such as corporate law, intellectual property law, and contract law.
- It can often identify Legal AI Capability Gaps in legal ideation, legal argumentation, and legal source provision.
- It can often facilitate Legal AI Fine-Tuning by legal AI developers and law firms to improve real-world legal utility.
- ...
- It can range from being a Simple Legal Drafting Benchmark to being a Complex Legal Risk Assessment Benchmark, depending on its legal task complexity.
- It can range from being a Core Legal Task Benchmark to being an Advanced Legal Workflow Benchmark, depending on its legal evaluation scope.
- It can range from being a Single Legal Domain Benchmark to being a Comprehensive Legal Practice Benchmark, depending on its legal domain coverage.
- ...
- It can employ Custom Legal Rubrics evaluating legal task completion, legal professional tone, legal relevance, and legal hallucination detection.
- It can utilize Legal Time Entry Mapping for realistic legal work representation across legal practice areas.
- It can supplement Traditional Legal AI Benchmarks that focus on multiple-choice legal exams rather than practical legal work.
- It can apply Positive-Negative Scoring Systems combining legal achievement points with legal error penaltys.
- It can measure Legal Model Performance Differentials between proprietary legal AI models and public foundation models.
- It can provide Legal Task Taxonomy dividing legal work by legal practice area, legal work type, and legal matter portion.
- It can support Legal Workflow Agent Evaluation through complex legal workflow datasets like SPA deal point extraction.
- It can enable Legal AI Transparency through public legal benchmark results and legal evaluation methodology.
- ...
Example(s):
- BigLaw Bench Core Tasks, such as:
  - Legal Transactional Task Categorys demonstrating legal analytical capability, such as:
    - Legal Confidentiality Clause Drafting Task for merger agreement preparation, evaluated on legal completeness, legal accuracy, and client need appropriateness.
    - Legal Due Diligence Task for legal provision extraction from service agreements.
    - Legal Transaction Structuring Task for financial option analysis including PIPE transactions and equity offerings.
    - Legal Contract Provision Extraction Task for assignment clause, indemnification clause, and termination clause identification.
  - Legal Litigation Task Categorys demonstrating legal argumentation capability, such as:
    - Legal Risk Assessment Task for business action evaluation, requiring relevant law identification with proper legal citations.
    - Legal Argument Construction Task for legal position development.
    - Legal Document Drafting Task for litigation document preparation.
    - Legal Case Analysis Task for legal precedent application.
- BigLaw Bench Retrieval Datasets, such as:
  - Legal Contract Retrieval Tasks demonstrating complex legal document navigation, such as:
    - Merger Agreement Analysis Task handling hundred-page legal documents with legal cross-reference tracking.
    - Stock Purchase Agreement (SPA) Task for legal deal point extraction achieving 98.47% accuracy (Harvey, 2024).
  - Legal Discovery Email Tasks demonstrating high-volume legal document processing, such as:
    - Legal Email Thread Analysis Task for legal communication pattern recognition.
    - Legal Metadata-Based Retrieval Task using sender information, recipient information, and attachment data.
- BigLaw Bench Performance Results, such as:
  - Harvey Assistant (2024), achieving 74% legal answer score overall, outperforming public foundation models.
  - GPT-4o Model (2024), achieving 61% legal answer score on legal tasks.
  - Claude 3 Opus (2024), showing legal source hallucination when prompted for specific document passages.
  - Gemini 2.5 Pro Preview (2025), achieving 85.02% legal answer score on BigLaw Bench.
  - GPT-5 Model (2025), achieving 89.22% legal answer score overall.
- BigLaw Bench Workflow Agents, such as:
  - SPA Deal Point Extraction Agent (2024), achieving 98.47% legal deal point identification accuracy.
  - Legal Document Analysis Agents for regulatory compliance checking.
- ...
Counter-Example(s):
- LegalBench, which uses academic legal reasoning tasks with IRAC framework rather than actual billable legal work.
- CUAD (Contract Understanding Atticus Dataset), which focuses on contract clause classification rather than comprehensive legal task evaluation.
- Rule QA Task, which evaluates legal rule knowledge rather than practical legal work completion.
- Vals Legal AI Report, which uses law firm-designed legal tasks rather than time entry-derived legal tasks.
See: Legal AI Benchmark, Harvey.ai, Legal AI Evaluation Framework, Legal Time Entry, Lawyer-Quality Work Product, Legal Hallucination, Legal Source Verification.

References

2024

https://www.harvey.ai/blog/introducing-biglaw-bench
- NOTES:
  - BigLaw Bench is a framework for evaluating large language models (LLMs) on complex legal tasks using real-world examples.
  - The tasks used in BigLaw Bench are derived from time entries, which represent billable legal work performed by lawyers, covering tasks like risk assessment, drafting documents, and client advisory.
  - Existing benchmarks like multiple-choice tests are insufficient to evaluate the complex legal tasks that lawyers perform.
  - BigLaw Bench focuses on litigation and transactional tasks, reflecting different practice areas and the nature of legal matters.
  - Custom rubrics are used to evaluate the LLMs' performance, considering factors like task completion, tone, relevance, and common failure modes (e.g., hallucinations).
  - Answer score measures how much of a lawyer-quality work product the LLM completes, considering both positive achievements and negative errors.
  - Source score measures the LLM’s ability to provide accurate references to support its assertions, ensuring traceability and verifiability.
  - Public LLMs often perform well in generating content but struggle with providing accurate sourcing, leading to lower source scores.
  - Transactional tasks generally see better model performance, as they are more analytical, whereas litigation tasks require ideation and argumentation, areas where LLMs underperform.
  - Foundation models tend to hallucinate sources when explicitly asked to provide references, leading to lower accuracy in sourcing.
  - The evaluation methodology includes a combination of positive and negative scoring, which highlights both the strengths and weaknesses of LLMs in real-world legal tasks.

2024

https://github.com/harveyai/biglaw-bench
- NOTES:
  - BigLaw Bench is a comprehensive framework for evaluating large language models (LLMs) on complex, real-world legal tasks.
  - Developed by Harvey's legal research team, it aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work lawyers perform.
  - Tasks are organized into two primary categories: Transactional Task Categories and Litigation Task Categories, each with several specific task types.
  - The evaluation methodology uses custom-designed rubrics that measure Answer Quality (completeness, accuracy, appropriateness) and Source Reliability (verifiable and correctly cited sources).
  - Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps, such as hallucinations.
  - The final answer score represents the percentage of a lawyer-quality work product the LLM completes.
  - A data sample is available for preview, and full access to the dataset can be obtained by contacting Harvey directly.

BigLaw Bench

References

2024

2024

Navigation menu

Search