ContractEval Benchmark
Jump to navigation
Jump to search
A ContractEval Benchmark is an open source legal domain LLM evaluation benchmark that assesses clause-level risk identification performance by Carnegie Mellon University.
- AKA: Contract Evaluation Benchmark, ContractEval.
- Context:
- It can typically evaluate Large Language Models on clause-level legal risk identification tasks using CUAD test datasets.
- It can typically measure LLM performance across correctness measures, output effectiveness measures, and false response rates.
- It can typically assess both proprietary LLMs and open source LLMs for contract review tasks.
- It can typically support data confidentiality preservation through local model deployment.
- It can typically identify legal risk categorys across 41 clause types.
- ...
- It can often evaluate model reasoning strategys including thinking mode and non-thinking mode.
- It can often measure model quantization effects on legal task performance.
- It can often detect model laziness patterns through false no-related-clause responses.
- It can often benchmark category-specific performance across legal clause types.
- ...
- It can range from being a Simple ContractEval Benchmark to being a Comprehensive ContractEval Benchmark, depending on its contracteval benchmark evaluation scope.
- It can range from being a Basic ContractEval Benchmark to being an Advanced ContractEval Benchmark, depending on its contracteval benchmark metric complexity.
- ...
- It can utilize F1 Score Metrics for contracteval benchmark correctness evaluation.
- It can employ Jaccard Similarity Coefficients for contracteval benchmark output effectiveness.
- It can integrate with CUAD Datasets containing 4,128 data points.
- It can process Legal Contracts with 301k character lengths.
- It can evaluate 19 State-of-the-Art LLMs including 4 proprietary models and 15 open source models.
- ...
- Example(s):
- ...
- Counter-Example(s):
- General Legal Benchmarks, which lack clause-level extraction capability.
- Document-Level Legal Benchmarks, which lack span-level prediction requirement.
- Legal Case Retrieval Benchmarks, which lack exact substring extraction.
- Legal Text Generation Benchmarks, which lack classification task focus.
- See: CUAD Dataset, Legal Contract Review Task, LLM Evaluation Benchmark, Clause-Level Risk Identification Task, Legal Domain Benchmark.
References
2025
- (Liu et al., 2025) ⇒ [[::Shuang Liu]], [[::Zelong Li]], [[::Ruoyun Ma]], [[::Haiyan Zhao]], and [[::Mengnan Du]]. ([[::2025]]). “ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts.” https://www.arxiv.org/abs/2508.03080