2023 LawBenchBenchmarkingLegalKnowle

(Fei, Shen et al., 2023) ⇒ Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. (2023). “LawBench: Benchmarking Legal Knowledge of Large Language Models.” In: arXiv preprint arXiv:2309.16289. doi:10.48550/arXiv.2309.16289

Subject Headings: LawBench Evaluation Benchmark, Legal NLP.

Notes

Cited By

http://scholar.google.com/scholar?q=%222023%22+LawBench%3A+Benchmarking+Legal+Knowledge+of+Large+Language+Models

Quotes

Abstract

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in this https URL. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

1. Introduction

...

Cognitive Level	ID	Task	Data Source	Metric	Type
Legal Knowledge Memorization	1-1	Article Recitation	FLK	Rouge-L	Generation
	1-2	Knowledge Question Answering	JEC_QA	Accuracy	SLC

	2-1	Document Proofreading	CAIL2022	F0.5	Generation
	2-2	Dispute Focus Identification	LAIC2021	F1	MLC
	2-3	Marital Disputes Identification	AIStudio	F1	MLC
Legal Knowledge Understanding	2-4	Issue Topic Identification Reading Comprehension	CrimeKgAssitant CAIL2019	Accuracy rc-F1	SLC
	2-5	Reading Comprehension	CAIL2019	rc-F1	Extraction
	2-6	Named-Entity Recognition	CAIL2022	soft-F1	Extraction
	2-7	Opinion Summarization	CAIL2021	Rouge-L	Generation
	2-8	Argument Mining	CAIL2022	Accuracy	SLC
	2-9	Event Detection	LEVEN	F1	MLC
	2-10	Trigger Word Extraction	LEVEN	soft-F1	Extraction

	3-1	Fact-based Article Prediction	CAIL2018	F1	MLC
	3-2	Scene-based Article Prediction	LawGPT	Rouge-L	Generation
	3-3	Charge Prediction	CAIL2018	F1	MLC
Legal Knowledge Applying	3-4	Prison Term Prediction w.o. Article	CAIL2018	nLog-distance	Regression
	3-5	Prison Term Prediction w. Article	CAIL2018	nLog-distance	Regression
	3-6	Case Analysis	JEC_QA	Accuracy	SLC
	3-7	Criminal Damages Calculation	LAIC2021	Accuracy	Regression
	3-8	Consultation	hualv.com	Rouge-L	Generation

...

3.2 Data Source and Selected Tasks

We selected 20 tasks falling under the above-mentioned capability levels. Every task is assigned a unique task id for better distinction. The task list is provided in Table 1. There can exist datasets belonging to the same tasks. When selecting the dataset for every task, we choose the most recent available version. Furthermore, certain tasks like legal case retrieval requires processing very long documents, which can surpass the length limit for most LLMs, so we do not include them to LawBench for now. When constructing LawBench, we have made efforts to format the prompts in a way that best aligns with user habits with clear instructions about the answer format, so that we can assess the ability of LLMs in assisting legal tasks in realistic scenarios.

Legal Knowledge Memorization Tasks

Legal knowledge memorization tasks examine to which extent large language models encode legal knowledge within their parameters. While this knowledge can be enhanced with external retrievers, it is still beneficial to memorize necessary legal knowledge because (1) There is currently no reliable mechanism to guarantee the accurate retrieval of legal provisions. Memorizing useful knowledge within model parameters can help combat the noise from external retrievers [70; 43]; (2) It is very difficult, if not impossible, to retrieve all needed legal knowledge for complicated reasoning tasks. The model must know basic legal concepts to connect the retrieved knowledge smoothly [50; 87; 74]; (3) Relying on the parametric knowledge instead of external retrievers can significantly reduce the online latency [51; 53; 59].

There are two major types of legal knowledge that requires memorizing: (1) core law articles and regulation content and (2) other fundamental legal concepts, notions and rules. We construct two tasks corresponding to these two types of knowledge:

Article recitation (1-1): Given a law article number, recite the article content. We collected the contents of laws and regulations from the national database 3 and consulted students with a legal background to select 152 sub-laws under the 5 core laws. We further incorporated updated laws and regulations, including constitutional amendments, to evaluate the model’s ability to comprehend legal changes.
Knowledge question answering (1-2): Given a question asking about basic legal knowledge, select the correct answer from 4 candidates. We collect knowledge-based questions from the JEC-QA tasks [85]. To simplify the process of locating answers during the test, we exclusively chose single-label questions from them.

Examples of these two tasks are in Appendix A.1. Legal Knowledge Understanding Tasks Legal knowledge understanding tasks examine to which extent large language models can comprehend entities, events, and relationships within legal texts. Understanding legal text is a pre-condition to utilize the knowledge in concrete downstream applications [15]. In total, we selected 10 tasks corresponding to different levels of legal knowledge understanding:

Document Proofreading (2-1): Given a sentence extracted from legal documents, correct its spelling, grammar and ordering mistakes, return the corrected sentence. Legal documents, as carriers of judicial authorities and the exercise of legal rights by citizens, demand utmost precision in their textual content. We sample the original and corrected legal sentences from the CAIL2022 document proofreading task. Possible mistake types are inserted into the instructions to let the model directly output the corrected sentence.
Dispute Focus Identification (2-2): Given the original claims and responses of the plaintiff and defendant, detect the points of dispute. In civil cases, the points of dispute represent the core of conflicts, intersection of contradictions, and issues over which the parties involved in the case are in contention. The automated recognition and detection of points of contention have practical significance and necessity for the development of the rule of law in our country. Specifically, we will provide the trial-related content from judgment documents, including the sections on claims and responses. The cases involve various legal matters such as civil loans, divorce, motor vehicle traffic accident liability, financial loan contracts, and more. We have carefully selected common types of points of contention from LAIC2021 to construct this test set.
Marital Disputes Identification (2-3): Given a sentence describing marital disputes, classify it into one of the 20 pre-defined dispute types. Marital disputes refer to the total sum of various disputes arising from love, marriage, and divorce. Among civil disputes, marital disputes are a common type of dispute. We have selected a publicly available marriage text classification dataset on AiStudio 4. This dataset consists of 20 categories, and a single text entry may have multiple labels.
Issue Topic Identification (2-4): Given a user inquiry, assign it into one of pre-defined topics. User inquiries are typically vague. Identifying the relevant topics in legal consulta- tions can help legal professionals better pinpoint key issues. We obtain the data from the CrimeKgAssistant project 5. We keep the most frequent 20 classes and sample 25 questions for each class to form our final test set.

3https://flk.npc.gov.cn/

4https://aistudio.baidu.com/datasetdetail/181754 5https://github.com/liuhuanyong/CrimeKgAssitant

Reading Comprehension (2-5): Given a judgement document and a corresponding question, extract relevant content from it to answer the question. Judicial documents contain rich case information, such as time, location, and character relationships. Intelligently reading and comprehending judicial documents through large language models can assist judges, lawyers, and the general public in obtaining the necessary information quickly and conveniently. We use the CAIL2019 reading comprehension dataset to build this task, removing question types related to binary and unanswerable questions. We retain single and multiple-segment data as our test set.
Named-Entity Recognition (2-6): Given a sentence from a judgement document, extract entity information corresponding to a set of pre-defined entity types such as suspect, victim or evidence. We sampled 500 examples from the CAIL2022 Information Extraction dataset as our test set. These 500 samples contain 10 entity types related to theft crimes.
Opinion Summarization (2-7): Given a legal-related public news report, generate a concise summary. Legal summaries typically include key facts of the case, points of contention, legal issues, legal principles applied, and the judgment’s outcome. It can provide a quick overview of the case content to improve the efficiency of legal professionals. We randomly select 500 samples from the CAIL2021 Legal Public Opinion Summary dataset for this task. We only select samples with less than 400 words to fit the length constraint of LLMs.
Argument Mining (2-8): Given a plaintiff’s perspective and five candidate defendant’s viewpoints, select one viewpoint that can form a point of dispute with the plaintiff’s per- spective. In court’s trial process, judgment documents play a crucial role in recording the arguments and evidence presented by both the plaintiff and the defendant. Due to differences in their positions and perspectives, as well as inconsistencies in their factual statements, disputes arise between the plaintiff and the defendant during the trial process. These points of contention are the key to the entire trial and the essence of judgment documents. This task aims to extract valuable arguments and supporting materials from a large volume of legal texts, providing strong support for legal debates and case analysis. We use CAIL2022’s Argument Mining dataset to construct our dataset, transforming the identification of focal points of disputes into a multiple-choice question format.
Event Detection (2-9): Given a sentence from a legal judgement document, detect which events are mentioned in this sentence. Events are the essence of facts in legal cases. Therefore, Legal Event Detection is fundamentally important and naturally beneficial to case understanding and other Legal AI tasks. We construct the test set from the LEVEN dataset[72] by sampling sentences corresponding to the top 20 most frequent event types. Multiple events can be mentioned in every sentence.
Trigger Word Extraction (2-10): Given a sentence from a legal judgment document and its corresponding events, predict which words in the sentence triggered these events. Trigger words directly cause events and are an important feature that determines the event category, providing post-hoc explanation for the event types we identify. Directly identifying trigger words is very difficult, so we simplified this task by providing the events contained in the text along with the text information, examining the ability of LLMs to recognize trigger words related to events. When constructing the trigger word test set, we removed trigger words that were the same as the event type, as well as events with multiple or duplicate trigger words from the LEVEN dataset[72], to include as different trigger words as possible.

Examples of the 10 understanding tasks are in Appendix A.2. Legal Knowledge Applying Tasks Legal knowledge applying tasks primarily examine the ability of LLMs to not only understand legal knowledge but also simulate law professionals to apply the knowledge in solving realistic legal tasks. In the task design, we extensively examine the model’s different reasoning abilities, including 3 legal content reasoning tasks: legal judgement prediction, case analysis, consultation, and 1 numerical reasoning task: criminal damages calculation. When predicting case judgments, judges follow a certain order when hearing a case [84; 26]. There- fore, in constructing the case judgment prediction task, we simulated this process by decomposing the CAIL2018 dataset into three tasks: fact-based article prediction (3-1), charge prediction (3-3) and prison term prediction. We further separate the task of prison term prediction into two scenarios:

without article content (3-4) and with article content (3-5) to examine LLMs’ capability in utilizing the article content to make accurate judgement predictions. Besides, we also add the task scene-based fact prediction to simulate judges’ recognition of legal provisions (3-2).

Fact-based Article Prediction (3-1): Given a fact statement from the legal judgement document, predict which article items should be applied. When judges make decisions, they usually associate relevant articles with the facts of the case [20; 41]. Article prediction can assist judges in quickly locating legal articles related to legal texts. Legal articles are written expressions of legal norms, which are rules and regulations with clear meanings and legal effects. The model needs to deduce potentially applicable legal provisions based on the given case description and related background information. We sample 500 cases from the CAIL2018 dataset for this task.
Scene-based Article Prediction (3-2): Given a described scenario and a related question, predict the corresponding article item. The CAIL2018 dataset only covers criminal law- related legal provisions. In order to comprehensively evaluate the ability of LLMs to analyze case facts and infer relevant legal provisions, we collected high-quality legal scenario-based question-and-answer data from public sources on GitHub[38]. This dataset was generated by inputting legal provisions into chatGPT to construct corresponding scenario-based questions and answers. We manually selected 5,000 question-and-answer pairs with accurate answers from the generated dataset. Based on this, we selected 252 core legal provisions’ scenario- based question-and-answer content as the test dataset.
Charge Prediction (3-3): Given fact statement from the legal judgement document and the applied article number, predict the cause of action (charge). Cause of action is a summary of the nature of the legal relationship involved in a litigation case, as formulated by the people’s court. Accurately predicting the cause of action can help improve judicial efficiency and fairness. In the process of filing and hearing cases, accurate prediction of the cause of action can help the court to allocate cases, allocate resources, and arrange trials, thereby improving judicial efficiency and fairness. We sampled 500 pieces of data from the CAIL2018 cause of action prediction dataset for this task.
Prison Term Prediction w.o. Article (3-4): Given fact statement from the legal judgement document, the applied article number and charge, predict the prison term. Prison term prediction refers to the process of predicting and estimating the possible sentence that a defendant may face during the criminal justice process based on the facts of the case, legal provisions, and relevant guiding precedents. It aims to make reasonable inferences about the length and form of the sentence by comprehensively considering various factors such as the nature of the crime, the circumstances of the offense, the social impact, and the defendant’s personal situation. We used the prison term prediction dataset from CAIL2018, removed some cases with the death penalty and life imprisonment, and randomly sampled 500 cases as the test dataset. During the process of judges’ sentencing, more information is usually taken into account to determine the prison term outcome. We simulated the judge’s analysis process by providing the relevant legal provisions and the charge of the case.
Prison Term Prediction w. Article (3-5): Given fact statement from the legal judgement document, the applied article content and charge, predict the prison term. Large language models typically use retrieval-argument methods to introduce new information. Some publicly available models also include retrieval modules that provide detailed reference information for the model by retrieving legal provisions. We simulated this process, and unlike the previous task where only the legal provision number was provided, we provided the specific content of the legal provision in this task. When constructing the sentence prediction dataset, we appended the content of the legal provisions to the end of the question, allowing the model to complete the sentence prediction task in this scenario.
Case Analysis (3-6): Given a case and a corresponding question, select the correct answer from 4 candidates. We use the case analysis part from JEC_QA dataset [85] for this task. The case analysis part tests the ability of models to analyze real cases. Models must possess five types of reasoning in order to perform this analysis including word matching, concept understanding, numerical analysis, multi-paragraph reading, and multi-hop reasoning. In order to reduce the difficulty of the test and facilitate the acquisition of answers, we sampled 500 multiple-choice questions from the JEC_QA Case-Analysis part as the testing dataset.
Criminal Damages Calculation (3-7): Given a fact description about a criminal process, predict the amount of money involved in this case. There are some numerical computing tasks in the process of judicial trials, such as the calculation of the total amount of legal crimes. The total amount of the crime is an important sentencing factor. In some charges such as theft, financial fraud, and bribery, China’s laws determine the severity of the sentence based on the amount involved in the case. This task mainly tests the computing ability of LLMs. First, we examine whether the model understands the rules of case amount calculation, and second, we examine whether the model can accurately complete numerical calculations. We selected the LAIC2021 numerical computing task to construct our dataset.
Consultation (3-8): Given a user consultation, generate a suitable answer. Legal consul- tation is a way for the public to access legal services. Legal consultation can help people understand legal disputes and seek targeted advice and solutions from professional lawyers, as well as receive support and guidance. Some law firms and legal consulting companies also provide online legal consultation services, making it more convenient for people to obtain legal help. We collected legal consultation contents from the Hualv website 6, and our dataset contains both the answers to legal consultations and the corresponding legal basis, i.e., legal articles.

Examples of the 8 applying tasks are in Appendix A.3.

3.3 Evaluation

For every task, the evaluation is done following two steps: (1) answer extraction, which extracts the answer from the model prediction, and (2) metric computation, which computes the metric score based on the question, extracted answer and the gold answer. Answer extraction is a necessary step since many LLMs often do not generate output directly comparable with gold labels [4]. We explain these two steps in detail in the following section.

Answer Extraction Most of the tasks require the prediction to be in the standard format in order to compare with the ground truth, we define a set of task-specific rules to extract the answer from the model prediction.

Article Number Extraction (3-1): this type of tasks requires us to extract the article numbers predicted by the model. To do this, we use the delimiter “、” to separate the prediction

text into chunks of text, and then the cn2an7 library is used to convert the Chinese numerals to Arabic numerals within each of those chunks. Using a regular expression, we extract the converted Arabic numerals as the expected article numbers; if more than one number appears in the same chunk, only the first number is extracted. All extracted numbers are combined to form the final set of predictions.

Prison Term Extraction (3-4, 3-5): for this type of tasks, we need to extract the predicted prison terms from the prediction text. To begin, we use cn2an to convert all the Chinese numerals in the prediction to Arabic numerals; we then extract digits that are followed by

time intervals in Chinese, such as “个月” (month) and “年” (year). The extracted prison terms are normalized to months, meaning that the numbers appearing before “年” will be multiplied by 12. Note that the time unit in the ground truth answer is also month.

Criminal Damages Extraction (3-7): We extract all numbers appearing in the prediction text using regular expression. The set of of the extracted numbers is considered as the predicted criminal damages.
Named-Entity Recognition (2-6): We find all occurrences of entity types from the model prediction, then obtain the substring from its occurrence to the delimiter “\n”, then apply a regular expression to extract the entity value.
Trigger Word Extraction (2-10): We split the model prediction by the delimiter “；” , then treat the split array as a list of extracted key words.
Option Extraction (1-2, 2-2, 2-3, 2-4, 2-8, 2-9, 3-3): this type of task is similar to selecting the correct options from a list of options in a multiple-choice task. We run through all

possible options and check if they appear in the prediction text. The set of options that occur in the prediction text is collected and used for evaluation.

6www.66law.com

7https://github.com/Ailln/cn2an

Others (1-1, 2-1, 2-5, 2-7, 3-2, 3-8): we take the model prediction as the answer without performing any extraction step.

Metrics After the answer extraction phase, we compute the final metric based on the extracted answer. We defined 7 different metrics in total to measure different types of tasks:

Accuracy: Accuracy is a binary score that performs exact match between the model prediction and the gold answer. This applies to all single-label classification tasks including task 1-2, 2-4, 2-8, 3-6, and the regression task 3-7. For SLC tasks, if multiple valid answers are extracted from the model prediction, then we always treat it as wrong 8.
F1: When there are multiple output labels, F1 score measures the harmonic mean of the precision and recall. This applies to all multi-label classification tasks including task 2-2, 2-3, 2-9, 3-1 and 3-3.
rc-F1: rc-F1 is the F1 score tailored for the reading comprehension task 2-5. It treats every token as a label, removes punctuation, stories, extra whitespace, performs other necessary normalizations then compute the F1 score. We adopt the official script from CAIL2019 to compute the instance-level rc-F1 score 9.
soft-F1: For extraction tasks 2-6 and 2-10, the output is a set of phrases. Instead of using the standard F1 score, we use a soft version by replacing the phrase-level exact match with the rc-F1 score, then computing the F1 on top of it. We find using the soft version helpful since LLMs often use wording choices different from the ground truth.
nLog-distance: For the prison term prediction tasks 3-4 and 3-5, we evaluate them with the normalized log distance (nLog-distance) to capture the continuity of prison terms. We compute the logarithm of the difference between the extracted and gold answer, then normalize it to the space between 0 and 1 for better compatibility with other metrics.
F0.5: For the document proofreading task 2-1, we use the F0.5 metric to evaluate it. The F0. 5 score gives more weight to precision than to recall we want to prevent introducing more false positives than identify every other error in proofreading [81]. We use the ChERRANT toolkit to align the extracted and gold answer before computing the F0.5 score 10. As the alignment can take too long to respond for very bad generations, we add a time-out of 10 seconds. If a time-out happened, then the prediction is assigned a score of 0.
Rouge-L: For other generation tasks 1-1, 2-7, 3-3 and 3-8, we use the Rouge-L score to evaluate them. Rouge-L is a commonly used metric in generation tasks. It takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically to compare the extracted and gold answers [37].

Several large language models may decline to respond to legal-related inquiries due to security policies or simply fail to follow the instructions. To capture this issue, we also report the abstention rate of LLMs in each task (how often an LLM abstains to answer). An abstention happens if an answer cannot be extracted from the model prediction. The abstention rate does not apply to task 2-5 and all generation tasks since they do not need the answer extraction step.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 LawBenchBenchmarkingLegalKnowle	Kai Chen Zhiwei Fei Xiaoyu Shen Dawei Zhu Fengzhe Zhou Zhuo Han Songyang Zhang Zongwen Shen Jidong Ge			LawBench: Benchmarking Legal Knowledge of Large Language Models				10.48550/arXiv.2309.16289		2023