LegalRikai Contract NLP Benchmark Dataset

A LegalRikai Contract NLP Benchmark Dataset is an open source Japanese legal NLP benchmark dataset that evaluates NLP models on Japanese legal contract analysis tasks.

AKA: LegalRikai Dataset, LegalRikai Benchmark, LegalRikai 2 Dataset.
Context:
- It can typically evaluate LegalRikai Contract NLP Benchmark Dataset Model Performance through legalrikai contract nlp benchmark dataset task metrics like F1-score, precision, and recall.
- It can typically include LegalRikai Contract NLP Benchmark Dataset Component Tasks such as legalrikai contract nlp benchmark dataset clause classification, legalrikai contract nlp benchmark dataset clause revision, and legalrikai contract nlp benchmark dataset clause identification.
- It can typically support LegalRikai Contract NLP Benchmark Dataset Zero-Shot Evaluation for baseline model capabilities in legalrikai contract nlp benchmark dataset prompting settings.
- It can typically support LegalRikai Contract NLP Benchmark Dataset Few-Shot Evaluation with limited examples like 8-shot for revision or 10-shot for identification.
- It can often measure LegalRikai Contract NLP Benchmark Dataset LLM Performance on generative models like GPT-4, Claude, and Gemini variants.
- It can often assess LegalRikai Contract NLP Benchmark Dataset Revision Quality via legalrikai contract nlp benchmark dataset LLM-based criteria such as alert addressing, meaning preservation, grammar, naturalness, and logical flow.
- It can often highlight LegalRikai Contract NLP Benchmark Dataset Performance Gaps where models achieve mid-40s F1 in classification and retrieval but lower substantive scores in revision.
- It can range from being a Classification-Focused LegalRikai Contract NLP Benchmark Dataset to being a Retrieval-Focused LegalRikai Contract NLP Benchmark Dataset, depending on its legalrikai contract nlp benchmark dataset task emphasis.
- It can range from being a Zero-Shot LegalRikai Contract NLP Benchmark Dataset to being a Few-Shot LegalRikai Contract NLP Benchmark Dataset, depending on its legalrikai contract nlp benchmark dataset prompting configuration.
- It can provide LegalRikai Contract NLP Benchmark Dataset Resources for legalrikai contract nlp benchmark dataset research in Japanese legal tech.
- It can incorporate LegalRikai Contract NLP Benchmark Dataset Expert Annotations from legal professionals for gold-standard labels and revisions.
- It can support LegalRikai Contract NLP Benchmark Dataset Model Comparison across families like OpenAI GPT-4 variants, Anthropic Claude, and Google Gemini.
- It can emphasize LegalRikai Contract NLP Benchmark Dataset Domain Specificity for realistic Japanese contract review workflows.
- ...
Examples:
- LegalRikai Contract NLP Benchmark Dataset Clause Classification Subsets, such as:
  - LegalRikai Contract NLP Benchmark Dataset Development Clause Classification Set with 46,477 clauses for training or few-shot examples.
  - LegalRikai Contract NLP Benchmark Dataset Evaluation Clause Classification Set containing 12,876 clauses and 492 unique legal topic labels.
  - LegalRikai Contract NLP Benchmark Dataset Multi-Label Clause Classification Subset for evaluating over-prediction tendencies in LLMs.
- LegalRikai Contract NLP Benchmark Dataset Clause Revision Subsets, such as:
  - LegalRikai Contract NLP Benchmark Dataset Revision Test Set with 378 clauses and expert-written references.
  - LegalRikai Contract NLP Benchmark Dataset Alert-Based Revision Subset for simulating legal issue resolution with guidance.
  - LegalRikai Contract NLP Benchmark Dataset Few-Shot Revision Example Set containing 8 clauses for prompting demonstrations.
- LegalRikai Contract NLP Benchmark Dataset Clause Identification Subsets, such as:
  - LegalRikai Contract NLP Benchmark Dataset Identification Test Set with 3,200 query-contract pairs.
  - LegalRikai Contract NLP Benchmark Dataset Query-Based Identification Subset for retrieval of relevant clauses in review queries.
  - LegalRikai Contract NLP Benchmark Dataset Few-Shot Identification Example Set containing 200 pairs for prompting.
- LegalRikai Contract NLP Benchmark Dataset Model Evaluation Results, such as:
  - LegalRikai Contract NLP Benchmark Dataset Claude 3.7 Sonnet Result achieving ~41 F1 in zero-shot classification.
  - LegalRikai Contract NLP Benchmark Dataset GPT-4.1 Result achieving ~47.6 F1 in few-shot identification.
- ...
Counter-Examples:
- General NLP Benchmark Datasets, which lack Japanese legal focus rather than legalrikai contract nlp benchmark dataset domain specificity.
- English Legal Benchmark Datasets, which use English texts rather than legalrikai contract nlp benchmark dataset Japanese language.
- Non-Contract Legal Datasets, which focus on statutes or judgments rather than legalrikai contract nlp benchmark dataset contract analysis.
- Non-Legal Japanese NLP Datasets, which omit legal tasks rather than legalrikai contract nlp benchmark dataset legal domain emphasis.
See: Japanese Legal NLP Benchmark Task, Legal Clause Classification Task, Legal Clause Revision Task, Legal Clause Identification Task, LLM-Based Legal Revision Evaluation Task, LexGLUE Benchmark, Benchmark Dataset, Contract-Related NLP Task, Legal Text Analysis Task.

References

2025-01-18

Deep Research https://chatgpt.com/s/dr_687aa15754608191a3a56a8cea8893c1
- Question: Analysis of LegalRikai 2, a Japanese Legal NLP benchmark dataset for evaluating NLP models on Japanese legal document tasks
- Answer: LegalRikai is a comprehensive benchmark dataset for Japanese legal language processing, developed by LegalOn Technologies (formerly LegalForce). It is designed to cover a range of tasks specific to the Japanese legal domain, particularly focusing on contract documents.
  - Introduction: Purpose and Scope of LegalRikai
    - Goal: The goal of LegalRikai is to fill important gaps in existing NLP benchmarks for Japanese legal text, providing a standardized way to evaluate and compare NLP models on realistic legal tasks.
    - Multiple Task Benchmarks: By offering multiple task benchmarks (spanning classification, text revision, and information retrieval in a legal context), LegalRikai serves both research and industry as a resource to measure progress in legal AI and identify where current models succeed or fall short in understanding Japanese legal language.
  - Benchmark Tasks in LegalRikai
    - Three Main Tasks: LegalRikai currently consists of three main tasks, each targeting a different aspect of legal document analysis in Japanese contracts:
      - Clause Classification: Identify and label a given contract clause with one or more predefined legal topic categories. This is a multi-label text classification task where a clause may pertain to multiple legal concepts (e.g. a clause might be labeled as both "Subcontracting" and "Prohibition of Assignment" if it forbids subcontracting and assignment without consent).
        Dataset Size: The dataset for this task contains 46,477 clauses in a development set (for training or few-shot examples) and 12,876 clauses in an evaluation set, with a total of 492 unique labels spanning various legal clause types.
        
        Evaluation: The task is evaluated with standard multi-label metrics (precision, recall, F1), and for generative models that output labels in text form, an exact match to the correct label wording is required to count as a correct prediction.
      - Clause Revision: Given a contract clause along with an alert describing a potential legal issue in that clause and guidance on how to fix it, the model must produce a revised version of the clause that addresses the issue.
        Example: If a clause says "If the client hires a candidate, they must notify the agency promptly" and the alert points out that it doesn't specify "in writing", the model should add language like "notify in writing" as guided.
        
        Dataset Creation: The dataset was created by repurposing the company's contract review module output: lawyers on the team manually crafted the "ideal" revisions for 378 clauses (test set) and a few-shot example set of 8 clauses (dev) to serve as references.
        
        Evaluation Methodology: The evaluation uses a strong LLM (GPT-4) to compare the model's revised clause against the human-written reference across multiple dimensions:
        Addressing the Alert: Does the model's revision effectively address the alerted issue and match the lawyer's correction?
        
        Meaning Preservation: Does the revised clause retain the intent and legal nuance of the reference solution written by the lawyer?
        
        Grammar: Is the revision grammatically correct to the same standard as the reference clause?
        
        Naturalness: Is the revised clause written in natural, fluent Japanese, on par with the reference?
        
        Logical Flow: Does the clause maintain a logical structure and flow, reflecting the same logical order as the reference correction?
      - Clause Identification for Review: Given a contract document and a specific review query or item (e.g. "Does the contract include a clause about the contractor's obligation to return materials provided by the client?"), the model must identify which clause(s) in the contract address that item.
        Task Nature: This is a targeted information retrieval or clause-finding task: the output is the clause number(s) that mention the topic in question.
        
        Dataset: The dataset includes 3,200 question-contract pairs in the test set and 200 in the dev set (for few-shot use).
        
        Evaluation: Evaluation is done like a multi-label classification: each relevant clause is a "label". Standard precision and recall are computed by comparing the set of clauses the model outputs to the set of correct clauses identified by legal experts.
  - Methodology and Models Evaluated
    - LLM Evaluation: The benchmark primarily evaluates Large Language Models (LLMs) under zero-shot and few-shot prompting settings (no fine-tuning on the tasks, but providing examples in the prompt for few-shot).
    - Tested Models: A variety of state-of-the-art models were tested:
      - OpenAI GPT-4 variants: Referred to as GPT-4.0 (GPT-4o) and GPT-4.1, along with smaller context or distilled versions GPT-4.1-mini and GPT-4.1-nano.
      - Anthropic Claude models: Claude 3.5 and Claude 3.7, each in sub-variants named Haiku and Sonnet.
      - Google Gemini models: Gemini 1.5 and Gemini 2.0, with versions "Flash" and "Pro" tested.
    - Evaluation Settings: All models were evaluated in a zero-shot setting for all tasks, and additionally, for the two generation-heavy tasks (Clause Revision and Clause Identification), a few-shot setting was also evaluated (8-shot for revision, 10-shot for identification).
  - Performance of NLP Models on LegalRikai Tasks
    - Clause Classification Results:
      - Overall Performance: In the clause classification task, none of the models achieved very high scores – the best model in zero-shot was Claude 3.7 Sonnet, which achieved an F1 around 41 (out of 100).
      - Precision vs. Recall: Recall was consistently higher than precision for almost all models, suggesting the models tend to over-predict labels, outputting more labels per clause than necessary.
      - Model Size Impact: Larger models outperformed smaller models within the same family, showing the benefits of scale. For example, Claude 3.7 Sonnet scored about 0.8 points higher F1 than Claude 3.5 Sonnet.
      - Invalid Label Generation: Claude models were particularly good at avoiding invalid labels compared to others, with Claude 3.5 Sonnet generating even fewer out-of-set labels than Claude 3.7 Sonnet.
    - Clause Revision Results:
      - Language Fluency: Grammar and fluency (naturalness) were handled quite well by most models – nearly all models scored around 85–90 out of 100 in grammar, and around the high 60s to low 70s in naturalness.
      - Legal Reasoning: The more substantive aspects – namely "addressing the alert" correctly, preserving the precise intended meaning, and maintaining logical flow – saw much lower scores.
      - Top Performers: Claude 3.5/3.7 Sonnet variants and GPT-4.1 were top performers, with Claude scoring ~59–60 on alert handling and ~55–56 on logical flow.
      - Few-Shot Benefits: Providing few-shot examples significantly helped with the more complex aspects, with meaning preservation and logical flow scores improving markedly.
    - Clause Identification Results:
      - Overall Performance: The best model was GPT-4.1, which achieved about 42.1 F1 (with ~60.9% precision and ~32.2% recall) in the zero-shot evaluation.
      - Conservative Retrieval: Precision was higher than recall, meaning the models are conservative in retrieval – they often miss clauses that should be retrieved but the ones they do return are usually correct.
      - Few-Shot Enhancement: Using few-shot examples (10-shot) improved performance for most models, with GPT-4.1's F1 rising to about 47.6.
  - Key Takeaways from the Benchmark Results
    - Uneven Performance: Current LLMs show uneven performance across different legal tasks: They perform fairly well on surface-level language tasks but struggle with deeper legal reasoning or comprehensive document analysis.
    - Task-Specific Trade-offs: Precision vs. Recall trade-offs vary by task: In clause classification, models tended to overshoot (high recall, low precision), while in clause retrieval, the opposite happened (high precision, low recall).
    - Scaling Benefits: Scaling up model size generally improves performance within each model family, with larger models achieving higher accuracy and making fewer obvious mistakes.
    - Task-Dependent Rankings: Model rankings can flip depending on task: Claude's models were particularly strong in clause classification, but GPT-4.1 took the lead in the clause identification task.
    - Few-Shot Benefits: Few-shot prompting helps with complex outputs, notably improving performance on tasks requiring more reasoning.
    - Improvement Headroom: Significant headroom remains for improvement, with even the best models achieving F1 scores in the 40s and 50s (out of 100) for core tasks.
  - Significance of LegalRikai in Japanese Legal NLP
    - Domain-Specific Benchmark: LegalRikai represents a significant development for Japanese legal NLP, as it is a rare example of a benchmark tailored to the complexities of Japanese legal language and documents.
    - Multi-Faceted Evaluation: The inclusion of multiple tasks acknowledges that legal understanding is multi-faceted, testing classification knowledge of legal concepts, the ability to carry out legal revisions, and the skill of extracting information from a contract.
    - Industry Applicability: From the industry perspective, LegalRikai's significance lies in its realism and direct applicability to legal tech products, mirroring real workflows in contract review.
    - Progress Measurement: LegalRikai can serve as a benchmark for clients and stakeholders to understand the capabilities of legal AI, providing concrete evidence of measurable performance.
  - Conclusion
    - Comprehensive Assessment: The LegalRikai benchmark provides a comprehensive look at how modern NLP models perform on Japanese legal tasks, covering clause classification, automated clause revision, and clause retrieval within contracts.
    - Incremental Progress: The key findings show incremental progress but also highlight that truly reliable legal NLP is still an open challenge – especially for tasks requiring precise legal reasoning and exhaustive understanding of a contract.
    - Future Impact: With continued updates to LegalRikai and inclusion of newer models, it is expected to play a pivotal role in driving forward the frontier of Japanese legal NLP, ultimately contributing to AI systems that can more adeptly support legal professionals and possibly transform how legal services are delivered.

2025-07

[1] Fujita, LegalOn Tech Blog – "日本法務分野のためのNLPベンチマークデータセットLegalRikaiの追加実験報告" (LegalRikai benchmark additional experiments report) - https://tech.legalforce.co.jp/entry/nlp-benchmark-dataset-legalrikai-2

LegalRikai Contract NLP Benchmark Dataset

References

2025-01-18

2025-07

Navigation menu

Search