BigLaw Bench
Jump to navigation
Jump to search
A BigLaw Bench is a proprietary real-world complex legal AI benchmark that can support legal AI evaluation tasks by Harvey.ai.
- AKA: BigLaw Bench Benchmark, Harvey BigLaw Bench, Harvey Legal AI Benchmark.
- Context:
- It can typically evaluate Legal AI System performance through legal time entry-based tasks derived from actual billable legal work.
- It can typically measure Legal Answer Quality by assessing lawyer-quality work product completion percentages against custom legal rubrics.
- It can typically assess Legal Source Reliability by measuring legal reference accuracy and legal citation verifiability.
- It can typically benchmark Legal Foundation Models on legal document drafting, legal reasoning, and legal risk assessment.
- It can typically provide Legal Performance Metrics through legal answer scores and legal source scores.
- ...
- It can often evaluate Legal Transactional Tasks including legal due diligence, legal contract analysis, and legal transaction structuring.
- It can often assess Legal Litigation Tasks including legal argument construction, legal document drafting, and legal case analysis.
- It can often focus on Legal Practice Areas such as corporate law, intellectual property law, and contract law.
- It can often identify Legal AI Capability Gaps in legal ideation, legal argumentation, and legal source provision.
- It can often facilitate Legal AI Fine-Tuning by legal AI developers and law firms to improve real-world legal utility.
- ...
- It can range from being a Simple Legal Drafting Benchmark to being a Complex Legal Risk Assessment Benchmark, depending on its legal task complexity.
- It can range from being a Core Legal Task Benchmark to being an Advanced Legal Workflow Benchmark, depending on its legal evaluation scope.
- It can range from being a Single Legal Domain Benchmark to being a Comprehensive Legal Practice Benchmark, depending on its legal domain coverage.
- ...
- It can employ Custom Legal Rubrics evaluating legal task completion, legal professional tone, legal relevance, and legal hallucination detection.
- It can utilize Legal Time Entry Mapping for realistic legal work representation across legal practice areas.
- It can supplement Traditional Legal AI Benchmarks that focus on multiple-choice legal exams rather than practical legal work.
- It can apply Positive-Negative Scoring Systems combining legal achievement points with legal error penaltys.
- It can measure Legal Model Performance Differentials between proprietary legal AI models and public foundation models.
- It can provide Legal Task Taxonomy dividing legal work by legal practice area, legal work type, and legal matter portion.
- It can support Legal Workflow Agent Evaluation through complex legal workflow datasets like SPA deal point extraction.
- It can enable Legal AI Transparency through public legal benchmark results and legal evaluation methodology.
- ...
- Example(s):
- BigLaw Bench Core Tasks, such as:
- Legal Transactional Task Categorys demonstrating legal analytical capability, such as:
- Legal Confidentiality Clause Drafting Task for merger agreement preparation, evaluated on legal completeness, legal accuracy, and client need appropriateness.
- Legal Due Diligence Task for legal provision extraction from service agreements.
- Legal Transaction Structuring Task for financial option analysis including PIPE transactions and equity offerings.
- Legal Contract Provision Extraction Task for assignment clause, indemnification clause, and termination clause identification.
- Legal Litigation Task Categorys demonstrating legal argumentation capability, such as:
- Legal Risk Assessment Task for business action evaluation, requiring relevant law identification with proper legal citations.
- Legal Argument Construction Task for legal position development.
- Legal Document Drafting Task for litigation document preparation.
- Legal Case Analysis Task for legal precedent application.
- Legal Transactional Task Categorys demonstrating legal analytical capability, such as:
- BigLaw Bench Retrieval Datasets, such as:
- Legal Contract Retrieval Tasks demonstrating complex legal document navigation, such as:
- Merger Agreement Analysis Task handling hundred-page legal documents with legal cross-reference tracking.
- Stock Purchase Agreement (SPA) Task for legal deal point extraction achieving 98.47% accuracy (Harvey, 2024).
- Legal Discovery Email Tasks demonstrating high-volume legal document processing, such as:
- Legal Contract Retrieval Tasks demonstrating complex legal document navigation, such as:
- BigLaw Bench Performance Results, such as:
- Harvey Assistant (2024), achieving 74% legal answer score overall, outperforming public foundation models.
- GPT-4o Model (2024), achieving 61% legal answer score on legal tasks.
- Claude 3 Opus (2024), showing legal source hallucination when prompted for specific document passages.
- Gemini 2.5 Pro Preview (2025), achieving 85.02% legal answer score on BigLaw Bench.
- GPT-5 Model (2025), achieving 89.22% legal answer score overall.
- BigLaw Bench Workflow Agents, such as:
- ...
- BigLaw Bench Core Tasks, such as:
- Counter-Example(s):
- LegalBench, which uses academic legal reasoning tasks with IRAC framework rather than actual billable legal work.
- CUAD (Contract Understanding Atticus Dataset), which focuses on contract clause classification rather than comprehensive legal task evaluation.
- Rule QA Task, which evaluates legal rule knowledge rather than practical legal work completion.
- Vals Legal AI Report, which uses law firm-designed legal tasks rather than time entry-derived legal tasks.
- See: Legal AI Benchmark, Harvey.ai, Legal AI Evaluation Framework, Legal Time Entry, Lawyer-Quality Work Product, Legal Hallucination, Legal Source Verification.
References
2024
- https://www.harvey.ai/blog/introducing-biglaw-bench
- NOTES:
- BigLaw Bench is a framework for evaluating large language models (LLMs) on complex legal tasks using real-world examples.
- The tasks used in BigLaw Bench are derived from time entries, which represent billable legal work performed by lawyers, covering tasks like risk assessment, drafting documents, and client advisory.
- Existing benchmarks like multiple-choice tests are insufficient to evaluate the complex legal tasks that lawyers perform.
- BigLaw Bench focuses on litigation and transactional tasks, reflecting different practice areas and the nature of legal matters.
- Custom rubrics are used to evaluate the LLMs' performance, considering factors like task completion, tone, relevance, and common failure modes (e.g., hallucinations).
- Answer score measures how much of a lawyer-quality work product the LLM completes, considering both positive achievements and negative errors.
- Source score measures the LLM’s ability to provide accurate references to support its assertions, ensuring traceability and verifiability.
- Public LLMs often perform well in generating content but struggle with providing accurate sourcing, leading to lower source scores.
- Transactional tasks generally see better model performance, as they are more analytical, whereas litigation tasks require ideation and argumentation, areas where LLMs underperform.
- Foundation models tend to hallucinate sources when explicitly asked to provide references, leading to lower accuracy in sourcing.
- The evaluation methodology includes a combination of positive and negative scoring, which highlights both the strengths and weaknesses of LLMs in real-world legal tasks.
- NOTES:
2024
- https://github.com/harveyai/biglaw-bench
- NOTES:
- BigLaw Bench is a comprehensive framework for evaluating large language models (LLMs) on complex, real-world legal tasks.
- Developed by Harvey's legal research team, it aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work lawyers perform.
- Tasks are organized into two primary categories: Transactional Task Categories and Litigation Task Categories, each with several specific task types.
- The evaluation methodology uses custom-designed rubrics that measure Answer Quality (completeness, accuracy, appropriateness) and Source Reliability (verifiable and correctly cited sources).
- Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps, such as hallucinations.
- The final answer score represents the percentage of a lawyer-quality work product the LLM completes.
- A data sample is available for preview, and full access to the dataset can be obtained by contacting Harvey directly.
- NOTES: