Stanford LegalBench Benchmark
(Redirected from LegalBench)
Jump to navigation
Jump to search
A Stanford LegalBench Benchmark is an open-source collaborative legal reasoning benchmark that can support legal AI evaluation tasks.
- AKA: Collaborative Legal Reasoning Benchmark.
- Context:
- It can typically evaluate Legal Reasoning Capability through IRAC-based legal tasks and non-IRAC legal tasks covering six legal reasoning types.
- It can typically include Legal Task Variety simulating real-world legal scenarios requiring legal principle application and legal reasoning.
- It can typically assess LLM Legal Performance on lawyer-associated tasks including legal issue identification, legal rule application, and legal conclusion drawing.
- It can typically provide Cross-Disciplinary Vocabulary bridging legal professionals and LLM developers through common legal frameworks.
- It can typically enable Legal AI Research through 162 legal tasks crafted by legal subject matter experts.
- ...
- It can often organize Legal Task Categorys using IRAC (Issue, Rule, Application, Conclusion) method as legal analysis framework.
- It can often include Non-IRAC Legal Tasks such as client counseling, contract analysis, and legal negotiation.
- It can often facilitate Ongoing Legal Contributions from legal professionals and AI researchers reflecting dynamic legal field changes.
- It can often measure Practical Legal Utility through lawyer-designed tasks representing interesting legal reasoning skills.
- It can often support Empirical Legal AI Evaluation of open-source LLMs and commercial LLMs on legal reasoning tasks.
- ...
- It can range from being a Simple Legal Classification Benchmark to being a Complex Legal Reasoning Benchmark, depending on its legal task complexity.
- It can range from being a Single Legal Domain Benchmark to being a Comprehensive Legal Domain Benchmark, depending on its legal coverage scope.
- It can range from being an Academic Legal Benchmark to being a Practical Legal Application Benchmark, depending on its legal use case focus.
- ...
- It can employ IRAC Framework Categorization for legal task organization across issue tasks, rule tasks, and application tasks.
- It can utilize Collaborative Construction Process involving interdisciplinary teams of computer scientists and lawyers.
- It can provide Legal Task Datasets with input-output pairs for LLM evaluation.
- It can enable Legal Reasoning Type Analysis across six distinct legal reasoning categorys.
- It can support Legal Community Contributions through open science frameworks for benchmark expansion.
- It can facilitate Legal AI Safety Assessment for legal workflow integration.
- It can measure Legal Task Completion Frequency evaluating how often LLMs generate desired legal outputs.
- ...
- Example(s):
- LegalBench Task Familys, such as:
- CUAD (Classification) Tasks demonstrating legal clause classification capability, such as:
- Rule QA Tasks demonstrating legal rule knowledge, such as:
- Abercrombie Tasks demonstrating legal application and conclusion, such as:
- Hearsay Tasks demonstrating legal evidence analysis, such as:
- Diversity Jurisdiction Tasks demonstrating legal procedural analysis, such as:
- Personal Jurisdiction Tasks demonstrating legal authority analysis, such as:
- PROA (Private Right of Action) Tasks demonstrating legal statutory interpretation, such as:
- Intra-Rule Distinguishing Tasks demonstrating legal issue spotting, such as:
- LegalBench Performance Results (2023), such as:
- GPT-4 Model (2023), evaluated on 162 legal tasks.
- Claude Model (2023), tested on IRAC-based legal reasoning.
- Open-Source LLMs including LLaMA variants on legal benchmark tasks.
- LegalBench Contributions, such as:
- ...
- LegalBench Task Familys, such as:
- Counter-Example(s):
- BigLaw Bench, which uses billable time entry tasks rather than academic legal reasoning tasks.
- LexGLUE, which focuses on multilingual legal understanding rather than English legal reasoning.
- LawBench, which emphasizes Chinese legal system tasks rather than common law reasoning.
- CUAD Benchmark, which limits to contract understanding rather than comprehensive legal reasoning.
- See: Legal Reasoning, Large Language Model, IRAC Method, Legal AI Benchmark, Open Science Initiative, Stanford HAI.
References
2023
- (Guha et al., 2023) ⇒ Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, and Daniel N. Rockmore. (2023). “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.” In: arXiv preprint arXiv:2308.11462. doi:10.48550/arXiv.2308.11462.
- ABSTRACT: The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.