2023 BringingOrderIntotheRealmofTran

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Legal NLP Benchmark.

Notes

Cited By

Quotes

Abstract

Transformer-based language models (TLMs) have widely been recognized to be a cutting-edge technology for the successful development of deep-learning-based solutions to problems and applications that require natural language processing and understanding. Like for other textual domains, TLMs have indeed pushed the state-of-the-art of AI approaches for many tasks of interest in the legal domain. Despite the first Transformer model being proposed about six years ago, there has been a rapid progress of this technology at an unprecedented rate, whereby BERT and related models represent a major reference, also in the legal domain. This article provides the first systematic overview of TLM-based methods for AI-driven problems and tasks in the legal sphere. A major goal is to highlight research advances in this field so as to understand, on the one hand, how the Transformers have contributed to the success of AI in supporting legal processes, and on the other hand, what are the current limitations and opportunities for further research development.

1. Introduction

2 Background on Transformer-based Language Models

3 Problems and Tasks

Transformer-based language models are leading a significant advance in AI-based NLP research to bring in better support for human decision-making processes in the legal domain. In this section, we present the main types of legal problems that are recognized as those particularly benefiting from AI-based NLP research, and we discuss the associated tasks that are being powered by BERT and related models. We organize our discussion on the legal problems into three broad areas, namely search (Section 3.1), review (Section 3.2), and prediction (Section 3.3) — through our discussion, we attempt to organize the flow of presentation by distinguishing tasks involving codes (e.g., statutes, regulations, contracts) from those concerning case law; however, it is often the case that a task can be regarded as relevant for any type of legal document. Please note that the three macro categories are actually interleaved and interrelated in many practical scenarios, therefore our purpose of classification should be taken with a grain of salt, mainly for the sake of presentation. Throughout the remainder of the paper, we will use abbreviations whose description is reported in Table 2.

Table 2. Abbreviations and descriptions of most relevant tasks in this article.

| Abbreviation | Description | |-|-| | AS/ES | abstractive/extractive summarization | | AVP/ALVP | article/alleged violation prediction | | CIR | case importance regression | | CJP(E) | court judgment prediction (and explanation) | | CLM | causal language modeling | | CTR | case term recognition | | DR | document retrieval/recommendation | | DS/SS | document/sentence similarity | | IE | information extraction | | IR | information retrieval | | LPP | legal precedent prediction/retrieval | | LJP | legal judgment prediction | | MLM | masked language modeling | | NER | named entity recognition | | NLI | natural language inference | | NSP | next sentence prediction | | OR | overruling | | PR/CR | passage/case retrieval | | QA; MCQA | question answering; multiple choice QA | | RC; MCRC | reading comprehension; multiple choice RC | | RIR | regulatory information retrieval | | RRL | rhetorical role labeling | | SA | sentiment analysis | | SAR | statutory article retrieval | | SF | slot filling | | STP | same topic prediction | | TC/SC; TpC | text/sentence classification; topic classification | | TM/CM | text/case matching |

3.1 Legal Search

Legal search corresponds to a need for legal information, and hence requires the detection and retrieval of documents potentially relevant to support legal decision-making. For instance, lawyers may search for laws enacted by parliaments or civil codes (similar to legislation in a civil law jurisdiction), but also for documents in litigation, patents, and several other documents that can support a law firm [Locke and Zuccon, 2022].

The searched documents are also called legal authorities in [Dadgostari et al., 2021], which points out how the legal search is driven by a notion of relevance, which should be «determined functionally with respect to norms and practices concerning legal reasoning and argumentation within a legal community». Thus, a document is regarded as «legally relevant exactly when it is understood by the dominant legal community as containing information that bears on a legal question of concern» [Dadgostari et al., 2021].

Legal search has been addressed in [Dadgostari et al., 2021] as a citation recommendation problem: given a citation- free legal text (CFLT), to find the most suitable set of opinions, from a reference legal corpus, to be cited for the input CFLT. Then, if the CFLTs are opinions from the corpus where all citation information is deleted, the search results can be compared to the actual citation information. More in general, legal search tasks are mainly addressed from two perspectives, namely Information Retrieval and Textual Entailment. While the former is intuitively seen as an essential part of any legal search task, the latter actually corresponds to Natural Language Inference, since it aims to determine, given any two textual fragments (e.g., two sentences), whether the one can be inferred from the other one; the entailment is said “positive”, resp. “negative” when the first text can be used to prove that the second text is true, resp. false, otherwise (i.e., if the two texts have no correlation) the entailment is regarded as “neutral” [Kim et al., 2021].

Since 2014, the Competition on Legal Information Extraction/Entailment (COLIEE) has served as an international forum to discuss issues related to legal information retrieval and entailment.38 The COLIEE editions from 2014 to 2017 focus on a two-phase legal question answering task: given a legal bar exam question q, the first phase is to retrieve a set of articles from a target Civil Code corpus (i.e., Japanese civil code) that are deemed as appropriate for answering q, whereas the second phase is to determine if the (gold) relevant articles entail q or not q. Since the 2018 edition, both the retrieval and entailment tasks are also applied to case law texts, which are relatively long documents consisting of the facts (i.e., factual statements) in a case. Searching for case law is a peculiarity of common-law jurisdictions, which comes about from the principle of “stare decisis” (doctrine of precedent), and has unique challenges that have emerged in law research [Locke and Zuccon, 2022]; conversely, for the civil-law jurisdictions, statutes are applied in the decision-making for a given legal issue in a mutatis mutandis approach, i.e., when asserting the substantial identity of two facts, we want to ignore the circumstances of a contingent nature, which are naturally different. The most recent edition at the time of writing of this article, COLIEE-2021 [Rabelo et al., 2022], proposes five tasks:

- Legal Case Retrieval (Task 1) – the goal is to identify the cases from a court case corpus that support the decision of a query case; such cases are also called “noticed” with respect to the query case, i.e., precedent cases that are referenced by the query case. Formally, given a set of candidate cases C = {c1, . . . , cn} and a query case q, the task is to identify the supporting cases Cq = {c | c ∈ C ∧ noticed(q, c)}, where noticed(q, c) denotes that c should be noticed given the query case q.

- Legal Case Entailment (Task 2) – given a query case, the goal is to identify one or more paragraphs from a case relevant to the query that entail(s) the decision of the query. Formally, given a query case q and a case ci relevant for q represented by its paragraphs {ci1, . . . , cini}, the task is to identify the set of paragraphs {cij | cij ∈ ci ∧ entails(cij , q)}, where entails(cij , q) is true if the paragraph cij entails q.

- Statute Law Retrieval (Task 3) – this is the former phase-1 in COLIEE-2014, i.e., given a civil code S and a legal bar exam question q, to retrieve the set of articles Sq from S such that entails(Sq, q) or entails(Sq, not q).

- Statute Law Entailment (Task 4) – this is the former phase-2 in COLIEE-2014, i.e., given a legal bar exam question q and relevant articles Sq, to determine if it holds that entails(Sq, q) or entails(Sq, not q).

- Legal Question Answering (Task 5) – this is regarded as a combination of Task 3 and 4 (although, in the COLIEE competition, any knowledge source other than the results of Task 3 can be used).

Training data are pairs ⟨query, noticed case(s)⟩ for Task 1, triplets ⟨query, noticed case(s), entailing paragraph IDs of the case(s)⟩ for Task 2, pairs ⟨query, relevant article(s)⟩ for Task 3, triplets ⟨query, relevant article(s), Y/N answer) for Task 4 and Task 5; the test data are only queries for Tasks 1, 3, and 5, whereas they include queries and relevant texts for Tasks 2 and 4.

It is worth noticing that supporting cases are relevant factors in court decision-making and are actually used in the attorney’s ligation [Nguyen et al., 2021a]. Legal case entailment is also useful in practice, since a decision for a new case can be predicted by implication of previous cases; it can also be treated in combination with case retrieval, as developed in [Vuong et al., 2023], where a supporting model is introduced to describe the case–case supporting relations and to define paragraph–paragraph and decision-paragraph matching strategies. Analogous considerations hold for the statute law tasks. Moreover, the latter are particularly challenging since, besides the need for addressing the long articles, legal bar exam questions describe specific legal cases, while the language used in statute law tends to be more general.

A further perspective on legal question answering is taken in [Zheng et al., 2021], where a multiple choice question answering task, dubbed CaseHOLD, is defined from legal citations in judicial rulings. The citing context from the judicial decision serves as the prompt for the question, whereas the answer choices are holding statements derived from citations following text in a legal decision. Holdings are central to the common law system, as they represent the predominating, precedential legal rule when the law is applied to a particular set of facts. Analogously, in [Xiao et al., 2021], legal question answering is addressed on the JEC-QA dataset, which consists of multiple-choice questions from the Chinese national bar exam, where the questions and candidate choices are concatenated together to form the inputs of the models.

3.2 Legal Document Review

Document review is another critical process for law practitioners and lawyers, as it usually involves document sets that are unmanageable for a team of humans given their amount, the cost of reviewers, and deadlines in the context of legal proceedings. The purpose of legal document review is for the parties to a case to organize and analyze the available documents so as to determine which are sensitive or otherwise relevant to the litigation. For instance, document review can be intended to negotiate or revise an agreement, ensure that the filings of an attorney’s client comply with appropriate regulations, modify a brief for a trial motion, inspect a contract to avoid potential risks, or review client tax documents. Relevance, responsiveness to a discovery request, privilege, and confidentiality are essential criteria for any document in the review, but also in the analysis of the information to relate key documents to alleged facts or key legal issues in the case.

[Shaghaghian et al., 2020] recognizes four main tasks of document review, namely information, fact, comparative, and rule navigation, which are primarily characterized in terms of the following problems:

- Passage retrieval – Navigating a user to answers for non-factoid questions is in fact seen as equivalent to retrieving relevant text passages during the document review process. Passage Retrieval to answer non-factoid questions can be modeled as a binary text classification task, i.e., given a set of queries {qi}i=1..Q and a set of candidate texts (e.g., sentences, snippets, paragraphs) {sj}j=1..N , each pair question-snippet (qi, sj) is assigned label 1 if sj contains the answer to qi, and label 0 otherwise.

- Named entity recognition – Examining factoid questions whereby the user is searching for specific facts or entities is instead modeled as named entity recognition (e.g., extraction of facts from a court decision document, such as Date of Argument, Date of Decision, Petitioner, Judge, Sought Damages and Damages Awarded Monetary Values). Named Entity Recognition to extract facts or elements of factoid questions can be modeled as a sequence labeling, multi-class classification task, i.e., given a set of fact-related classes {ci}i=1..C , each token is assigned a class (or a distribution over the classes).

- Text similarity – Computing text similarity is essential to identify matching texts according to different aspects; for instance, to identify the differences between a regulation and its amended version, or to discover the discrepancies of regulations in different jurisdictions. Text similarity to identify matching texts at various, pre-determined levels can be modeled as a binary, resp. multi-class, text classification task, i.e., given a set of matching levels {mi}m=1..M and a set of texts {sj}j=1..N , each pair of texts (sj , sk) is assigned a class mi depending on the degree of matching between sj and sk.

- Sentiment analysis – This can be addressed to identify the polarity, or the mood, associate with certain legal statements, with the purpose of, e.g., identifying rules imposed by deontic modalities, which are of the form of obligations, prohibition and permission statements. This can be modeled as a binary, resp. multi-class, text classification task, i.e., given a set of texts {sj}j=1..N , each text is assigned a class depending on the polarity or sentiment expressed in the text.

In [Xiao et al., 2021], legal reading comprehension is addressed to predict the start positions and end positions given question-answer pairs with corresponding supporting sentences. Legal document review is also related to document recommendation. As discussed in [Ostendorffet al., 2021], a typical recommendation scenario occurs during the preparation of a litigation strategy, when the involved legal professionals are provided with recommended other decisions that possibly cover the same topic or provides essential background information (e.g., they overrule the target decision). Also, text segmentation, i.e., the task of dividing a document into multi-paragraph discourse units that are topically coherent, can be useful to one or more of the above tasks, especially when the existing logical boundaries imposed to the document might not be sufficient to detect fine-grain topic changes. In [Aumiller et al., 2021], text segmentation is used to solve a topical change detection problem (also called same topic prediction): Given two chunks of text of the same type (e.g., paragraphs, sections) and binary labels, to determine if the two chunks belong to the same topic, otherwise a change in topic is detected and so the beginning of a new chunk of text. Also, [Savelka et al., 2021] introduce the task of automatic functional segmentation, which is to segment adjudicatory decisions of cases according to the functional role of the parts.

Contracts, in various forms, are major target of interest for document review tasks. [Zheng et al., 2021] focus on contract documents such as Terms-of-Services, for the detection of potentially unfair contractual terms. A contractual term (clause) is regarded as unfair if it has not been individually negotiated, and it corresponds to an evident imbalance in the rights and obligations of the parties, to the detriment of the consumer [Zheng et al., 2021]. A binary classification task can hence be defined, whereby positive examples are the potentially unfair contractual terms. The Terms-of-Service task can help consumers better understand the terms they agree to when signing a contract and ease access to legal advice about unfair contracts. [Hendrycks et al., 2021] address the legal contract review task, which is to analyze a contract to understand rights and obligations of the signatories as well as to evaluate the associated impact. This task can be seen as similar to extractive question answering, where each question is the description of a label category and language models have to detect the spans of the contract is related to the label. [Leivaditi et al., 2020] specialize the legal contract review task to lease agreements and address this task from two perspectives: detection of sentences expressing a potential risk to one or more signatories (binary classification) and extraction on important entities for the domain (entity recognition). Unlike [Hendrycks et al., 2021] and [Leivaditi et al., 2020], which aim to find what kinds of terms are present, [Koreeda and Manning, 2021] focus on knowing what exactly each of these terms states. Given a set of hypotheses and a contract, the task is to decide if the contract entails, contradicts or is neutral to each hypothesis (three-class classification) and detect the evidence, i.e., spans, in the contract that determine the decision (multi-label binary classification). On privacy policies, [Ahmad et al., 2021] define the intent classification task, which is to predict sentences explaining privacy practices, along with a slot filling task to detect text spans within a sentence expressing specific details. A slot extraction task is also performed in [Bui et al., 2021] to detect spans in the text expressing different types of user data.

Another legal context that can be included in the document review category to a broader extent concerns a special case of retrieval task, namely regulatory information retrieval [Chalkidis et al., 2021b], i.e., to ensure a regulatory compliance regime regarding an organization’s processes/controls. A compliance regime includes corrective, detective and preventive measures such that either, given a control/process, to retrieve relevant laws in order to apply corrective measures or, given a new law, to retrieve all the affected controls/processes in order to apply corrective or preventive measures. Regulatory information retrieval is defined as a special case of document-to-document information retrieval, since the query is an entire document — unlike traditional information retrieval, whereby queries are usually short texts.

More tasks concern case law documents. Legal cases are lengthy and unstructured, although they are actually characterized by an implicit thematic structure into sections such as “facts of the case”, “arguments given by the parties”, etc. These sections are often called as rhetorical roles. Identifying such semantic roles is essential for improving the roles readability of the documents but also helps in downstream tasks such as classification and summarization. The task is challenging since in most cases legal documents can vary in structure and rhetorical labels can be subjective. [Bhattacharya et al., 2019b] introduce the rhetorical role labeling task, which is to label sentences of a legal case with the corresponding rhetorical role. This task was also introduced in the context of the Artificial Intelligence for Legal Assistance (AILA) 2020 competition (Task 2), whereby the predefined labels are “Facts”, “Ruling by Lower Court”, “Argument”, “Statute cited”, “Precedent cited”, “Ratio of the decision”, and “Ruling by Present Court”.39 To support legal document review, special cases of retrieval are also involved. For instance, [Martino et al., 2022] deal with the identification of paragraph regularities in legal cases, which is addressed by using a nearest-neighbor search method to efficiently select the most similar paragraphs appearing in a set of reference documents. Explanatory sentence retrieval [Savelka and Ashley, 2021] is instead to retrieve useful sentences to explain predetermined legal concepts. Explanations of legal concepts can be inferred looking at how they have been applied in previous cases, allowing a lawyer to elaborate supporting or contrary arguments related to particular accounts of meaning. Searching through legal documents, a lawyer can find sentences mentioning a particular concept, but not all of them could be useful for explaining that concept. Therefore, the aim is to automatically rank sentences in order to assign higher scores to explanatory sentences.

It is also highly desirable for legal professionals dealing with cases to access to their summaries, also known as headnotes. However, creating headnotes is certainly time-consuming, therefore automatic summarization of legal judgments is another meaningful problem in the legal domain. Two related tasks have been introduced in the Artificial Intelligence for Legal Assistance (AILA) 2021 competition, namely to identify “summary-worthy” sentences in a court judgment (Task 2a) and to generate a summary from a court judgment (Task 2b).40 The former can be seen as a sentence classification task, whereas the latter can be addressed either by collecting the detected summary-worthy sentences so as to form extractive summaries or by using generative models to produce abstractive summaries.

39 https://sites.google.com/view/aila-2020/task-2-rhetorical-role-labeling-for-legal-judgements

40 https://sites.google.com/view/aila-2021/task-2-summarization-of-legal-judgements

3.3 Legal Outcome Prediction

Legal relevance is related to the well-known predictive theory of the law first introduced in [Oliver Wendell Holmes, 1897]. In contrast to previous definitions of the law, Holmes formulated the law as a prediction, particularly the behavior of a court, so as to build a more useful approach in practice when dealing with those individuals who care little for ethics or lofty conceptions of natural law (i.e., the “bad men”). Besides the Holmes’ theory, predictive tasks in law are more generally concerned with judicial opinions. For instance, as discussed in [Dadgostari et al., 2021], given the content of a source judicial opinion, one task is to predict the other opinions that are cited in the source document; or, given a source document and a set of related opinions identified by law professionals, to predict their answers. The primary predictive task in law is commonly referred to as legal judgment prediction (LJP), i.e., to predict the outcome of a judicial decision based on the relevant facts and laws [Aletras et al., 2016,Zhong et al., 2018, Chalkidis et al., 2019a]. For instance, [Aletras et al., 2016] define the problem of case prediction as a binary classification task, which is to predict whether one of a predetermined, small set of articles of the ECtHR Convention has been violated, given textual description of a case, which includes the facts, the relevant applicable law and the legal arguments. 41 In [Xiao et al., 2021], the LJP task is addressed on both criminal and civil cases from the CAIL-Long dataset. Fact descriptions are taken as input whereas the judgment annotations are extracted via regular expressions; each criminal case is annotated with the charges, the relevant laws, and the term of penalty, and each civil case is annotated with the causes of actions and the relevant laws. For criminal cases, the charge prediction and the relevant law prediction are formalized as multi-label classification tasks, whereas the term of penalty prediction task is formalized as a regression task. For civil cases, the cause of actions prediction is formalized as a single-label classification task, and the relevant law prediction is formalized as a multi-label classification task. In [Dong and Niu, 2021], the three types of prediction are addressed in a context of graph node classification, where a Transformer model is combined with a graph neural network model. [Malik et al., 2021] propose the court judgment prediction and explanation (CJPE) task, which requires to predict the decision of a case and to provide explanations for the final decision, where explanations correspond to portions in the case description that best justify the outcome. A related axis of prediction is the one introduced in [Mahari, 2021], dubbed as legal precedent prediction, which is to predict passages of precedential court decisions that are relevant to a given legal argument posed in the context of a judicial opinion or a legal brief. Both judicial opinions and legal briefs usually contain a number of independent legal arguments, each citing its own set of precedent, where the precedent depends on the context of the entire case as well as on the specific legal argument being made [Mahari, 2021]. Clearly, in common law jurisdictions, this is particularly useful as legal professionals build their arguments by drawing on judicial precedent from prior opinions. Another critical task is overruling prediction, i.e., to determine if a statement is an overruling, i.e., a sentence that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court (which establishes a different rule on the point of law involved). In [Zheng et al., 2021, Limsopatham, 2021], the overruling prediction is modeled as a binary classification task, where positive examples are overruling sentences and negative examples are nonoverruling sentences from the law. The overruling task is clearly important for legal professionals, since verifying whether cases remain valid and have not been overruled is essential to ensuring the validity of legal arguments. Case importance and article violation are also considered [Chalkidis et al., 2019a], [Limsopatham, 2021]. Predicting the importance of a case can be seen as a regression task, e.g., to measure on a scale from lower scores for key cases, to higher scores for unimportant cases. Given the facts of a case, article violation is to predict if any human rights article or protocol has been violated by the case (binary classification), or which human rights articles and/or protocols have been violated (if any) by the case (multi-label classification). A special case of the above task is the alleged violation prediction introduced in [Chalkidis et al., 2021c], whose aim is to predict the allegations made by applicants given the facts of each case. This can be useful to identify alleged violations for plaintiffs, facts supporting alleged violations for judges but also for legal experts to identify previous cases related to the allegations. The task is treated in [Chalkidis et al., 2021c] as a multi-label text classification, since the model might select multiple articles that were allegedly violated (according to the applicants). Employment notice prediction [Lam et al., 2020] is to predict the number of months awarded for reasonable notices in employment termination cases. If the employer does not comply with the obligation to provide an appropriate employment notice or payment in lieu of notice, judges determine the compensation that an employer owes to an employee at the time of termination. Courts might rely on factors such as length of service, employee’s age, character of employment, aggravated damages to establish what constitutes reasonable notices, but it is not clear how to weigh each individual factor and how they should be used. As a result, the case law on employment notice turns out to be 41 The above view has been recognized not only as one of the most challenging by the legal community, but it has also raised controversial debate on the role of AI applied to law. Adversarial opinion is in fact based on the evidence that, in real-life scenarios, judges are unlikely to defer to AI to decide the outcome of a case. Nonetheless, the authors adopt an opinion that is commonly shared with most researchers and practitioners of AI in law, whereby it should be seen as a powerful tool to aid legal professionals to increase their access to justice, and ultimately address unmet needs of the legal community. Page 28 of 83 Accepted for publication with Artificial Intelligence and Law, Springer Nature inherently inconsistent and subjective. [Lam et al., 2020] define this problem as a text classification task, in order to obtain a similar decision-making process of a judge who would rely allegedly on past cases and differences of fact to decide the amount of reasonable notice.

3.4 Benchmarks and Datasets

To complement our discussion so far, here we provide a summary of the main benchmarks and datasets that have been recognized as relevant in the TLM-based legal learning context. Our main focus is on those corpora that were used by the approaches covered in this work, which will be described next (Section 4). Note that we shall leave out of consideration the datasets used in the COLIEE Competitions, since they have already been described in Section 3.1. Our presentation is organized into three subsections, which describe corpora concerning caselaw documents, codes, and a combination of both, respectively; moreover, each subsection is further organized by possibly grouping corpora that are cohesive in terms of data type and task. Table 3 summarizes the datasets that we shall describe through this section, according to the legal document category, the data type, the source, the size, and the tasks for which the benchmarks were designed.

3.4.1 Caselaw data

[Strickson and Iglesia, 2020] propose a corpus of about 5K labeled UK court judgments, gathered from the web, for the task of JLP. Each law case is divided into separate judgments issued by individual judges, and each sentence in a judgment is labeled as “allow” or “dismiss” through a pattern matching approach. The dataset is used for the JLP task as a classification problem, whereby classic machine learning classifiers (e.g., support vector machine, random forest, logistic regression) are evaluated.

ECHR [Chalkidis et al., 2019a] contains allegations of violated provisions regarding the European Convention of Human Rights.42 Each case includes a list of facts and a score, provided by the Convention, representing the importance of the case in the case law’s development. Also, each case is mapped to the violated articles of the Convention. Moreover, [Quemy and Wrembel, 2022] present ECH-OD, a new database for storing and managing ECHR cases. This is designed to be automatically maintained and used as a unified benchmark to compare machine learning methods for the legal domain. The authors have provided the whole pipeline for the benchmark data extraction, transformation, integration, and loading as open-source software.

Swiss-Judgment-Prediction (SJP) [Niklaus et al., 2021] comprises 85K cases, in diachronic order, from the Federal Supreme Court of Switzerland (FSCS).43 The evaluation task is a binary classification of the judgment outcome (i.e., approval or dismissal). The dataset includes cases written in German, French and Italian, and is annotated with publication years, legal areas and cantons of origin. [Niklaus et al., 2021] evaluate XLNet, RoBERTa, AlBERT (cf. Section 2), GermanBERT, UmBERTo, CamemBERT (cf. Section 4.6), and two variants namely Hierarchical BERT and Long BERT, both in monolingual or multilingual versions (cf. Section 4.7).

GerDaLIR [Wrzalik and Krechel, 2021] is a dataset for legal precedent retrieval on German language.44 It is based on case laws gathered from the Open Legal Data [Ostendorffet al., 2020]. Passages containing references are considered queries, while the referenced law cases are labeled as relevant. The authors evaluate a set of retrieval methods on this dataset with Transformer-based re-ranking. In particular, they fine-tune GBERT and GELECTRA base versions using top-100 BM25 passage rankings and test the final models on top-1000 BM25 passage ranking; the use of ELECTRA for re-ranking has shown to lead to higher performances in most cases. [Urchs et al., 2021] introduce two further legal corpora for the German law. The first corpus45 contains about 32K decisions, enriched with metadata, from hundreds of Bavarian courts. There are 22 different types of decisions in the corpus, such as resolutions, judgments and end-judgments. This corpus is not intended for a specific task (for example, it can be used to detect the type of the decision). The second corpus46 is a subset of the former and contains 200 judgments, whose sentences (about 25K) were annotated by a domain expert w.r.t. four components of the text (written in the Urteilsstil style): “conclusion” (i.e., the overall result of the case), “definition” (i.e., abstract legal facts and consequences), “subsumption” (i.e., the ensemble of concrete facts and determination sentence), and “other” (i.e., sentences not labeled with any of the three previous labels). This corpus is intended for the automatic detection of conclusion, definition and subsumption components.

42 https://archive.org/details/ECHR-ACL2019

43 https://huggingface.co/datasets/rcds/swiss_judgment_prediction

44 https://github.com/lavis-nlp/GerDaLIR

45 https://zenodo.org/record/3936726#.ZAdMIXbMJD_

46 https://zenodo.org/record/3936490#.ZAdN7HbMJD_

[Zhong et al., 2019b] provide 92 expert-annotated extractive summaries of Board of Veterans’ Appeals (BVA) cases focused on post-traumatic stress disorder (PTSD), along with 20 test cases quadruple-annotated for agreement evaluation and two summaries for each test case and written by law experts.47 Each sentence is annotated considering six labels, namely issue, procedural history, service history, outcome, reasoning, evidential support. Also, [Walker et al., 2019] introduce a dataset to test the performance of rule-based script classifiers, comprising 50 factfinding decisions of BVA cases focused on veterans’ appeals to a rejected disability claim for service-related PTSD. Each sentence of the dataset is assigned a rhetorical role by domain experts as follows: finding sentence, if it states a finding of the fact, evidence sentence, if it states the content of a testimony, reasoning sentence, if it reports the reasoning of the judge underlying the findings of facts, legal-rule sentence, if it states legal rules in the abstract, and citation sentence, if it refers to legal authorities and other materials. The dataset is used to test two hypotheses: whether distinctive phrasing allows automatic classifiers to be developed on a small set of labeled decisions, and whether semantic attribution theory can provide a general approach to develop such classifiers. Results demonstrate that some use cases can be addressed using a very small set of labeled data.

Multi-LexSum48 is a collection of almost 9K expert-edited abstractive summaries for 40K writings of the Civil Rights Litigation Clearinghouse (CRLC),49 which provides information on federal US civil rights cases for various target audiences (lawyers, scholars, and the general public) [Shen et al., 2022]. It is designed for multi-document and single-document summarization tasks. The source documents are extremely long, with cases often having more than two hundred pages. Multi-LexSum provides multiple summaries with different granularity (from an extreme one-sentence summaries to summaries with more than five hundred words). Although the provided summaries are abstractive, they present a high fraction of terms included also in the source document.

RulingBR [Feij´o and Moreira, 2018] 50 comprises 10K Brazilian rulings for summarization on legal tasks, which were retrieved from the decision documents of the highest court in Brazil, Supremo Tribunal Federal (STF).51 Each decision document is composed of the following four parts: “Ementa” (i.e., summary), “Acordao” (i.e., judgment), “Relatorio” (i.e., report), and “Voto” (i.e., vote). The Ementa part is used as gold summary for the dataset. [Lage-Freitas et al., 2022] also propose a dataset consisting of about 4K legal cases from a Brazilian State higher court (Tribunal de Justica de Alagoas), with a focus on the Brazilian appeals system, assigning the appeals with labels regarding court decisions. Following [Aletras et al., 2016], the authors assume that there is enough similarity between the case description of legal judgments and appeals lodged by attorneys. Brazilian courts data are scraped from the Web and segmented into sections, identifying the description, decision and unanimity parts; then, description sentences are labeled according to the decision outcome (yes, no, or partial) and unanimity information (unanimity vs. non-unanimity).

CAIL2019-SCM [Xiao et al., 2019] is a dataset of about 8K triplets of cases of the Suprem People’s Court of China, concerning private lending.52 It was collected from the China Judgments Online53 for the CAIL competition, where participants were required to perform a similar case matching task, i.e., to detect which pair of cases in the triplet contains the most similar cases. Every document in the triplet refers to the fact description of a case. The most similar pair within each triple is detected by legal experts. The authors provide some baselines to compare the participants’ performance, one of which uses BERT to obtain embeddings of the two cases, for which the similarity score is computed. The CAIL competition was first held in 2018 [Xiao et al., 2018]. In CAIL2018, participants were required to perform a legal judgment prediction task divided in three sub-tasks: law article prediction, charge prediction, and term-of-penalty prediction. The input is the fact description of a criminal case, and the associated dataset is divided into two sub-datasets: CAIL-big (with more than 1.6M cases) and CAIL-small (about 130K cases). [Yu et al., 2022b] extended the fact prediction task data of CAIL 2021 for the explainable legal case matching task. The sentences of a legal case in CAIL 2021 are associated with several tags regarding the issue of private lending. In the proposed dataset, called eCAIL,54 the tagged sentences are considered as rationales. Given two legal cases, the cross-case sentences with identical labels are pro-rationales for the matching task, while sentences with different labels are con-rationales. A matching label is assigned for the case pair according the tag-overlapping: if there is a overlapping of more than 10 tags the cases are considered as matching, otherwise if there is no overlapping the label corresponds to mismatching, and an overlapping with less than (or equal to) 10 tags is considered as partially matching. The dataset provides 6K legal case pairs, with rationales and explanations (the concatenation of all the overlapped tags) for the matching labels. [Yu et al., 2022b] also provide ELAM, a dataset for explainable legal case matching task, containing 5K legal case pairs with the associated matching label, rationales, their alignments and the explanations for the matching decision.

47 https://github.com/luimagroup/bva-summarization

48 https://multilexsum.github.io/

49 https://clearinghouse.net/

50 https://github.com/diego-feijo/rulingbr

51 https://portal.stf.jus.br/

52 https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm

53 https://wenshu.court.gov.cn/

54 https://github.com/ruc-wjyu/IOT-Match

The authors collected the legal cases online,55 which refer to the crime of obstruction of the social management order. Each case is associated with several legal-related tags. To pair the legal cases, the authors randomly selected 1250 query cases and constructed a pool of candidates for each query. From the candidate pool, a case is retrieved based on the number of overlapping tags between the case and the query. Each sentence of a legal case pair is associated with a rationale label, with the support of legal experts. The possible rationale labels are the following: not a rationale, a key circumstance, a constitutive element of a crime, or a focus of disputes. The alignment of the rationales (i.e., pro and con rationales) and the matching label (matching, partially matching or not matching) are then marked. Legal experts are also asked to provide explanations for their matching decision.

ILDC (Indian Legal Documents Corpus) [Malik et al., 2021] comprises about 35K cases from the Indian Supreme Court, annotated with the court decisions.56 This is a corpus intended for court judgment prediction and explanation, which requires a model to predict the final outcome (accept or reject, w.r.t. the appellant) and to provide explanations for the given prediction. To this regard, a portion of the corpus is annotated with explanations given by legal experts, and ranked in order of importance in such as way that a higher rank corresponds to an explanation that is more important for the final judgment. The dataset is divided in ILCDsingle and ILCDmulti, depending on whether there is a single decision for documents having one or more petitions, or different decisions for documents with multiple appeals.

[Kalamkar et al., 2022] propose a corpus of 354 Indian legal judgment documents, annotated via crowd-sourcing activity with 12 different rhetorical roles, from different courts (Supreme Court of India, High Courts and districtlevel courts).57 The annotation process is designed with the support of legal experts. The corpus is intended for the automatic structuring of legal documents. A Transformer-based model is proposed as baseline for the benchmark. Moreover, the authors propose extractive/abstractive summarization and court judgment prediction tasks as two applications of rhetorical roles, as they test how rhetorical roles could be useful for those tasks. For extractive and abstractive summarization, they experiment with the LawBriefs corpus, which comprises 285 expert-authored extractive summaries of Indian court judgments. For the court judgment prediction task, experiments were conducted using the ILDC corpus [Malik et al., 2021].

Two further legal datasets for rhetorical role identification are introduced in [Bhattacharya et al., 2021]. One dataset contains 50 cases from the Supreme Court of India belonging to five law domains: criminal, land and property, constitutional, labour/industrial and intellectual property rights. Such documents are gathered from Thomson Reuters Westlaw India website.58 The other dataset contains 50 cases from the UK Supreme Court, gathered from the official website of the court.59. Both the datasets are labeled with the following seven rhetorical roles: “Facts”, “Ruling by Lower Court”, “Argument”, “Statute”, “Precedent”, “Ratio of the decision” and “Ruling by Present Court”. [Paul et al., 2022b] introduce a pre-training corpus consisting of about 5.4M Indian court cases. The documents are gathered from several web platforms and come from the Supreme Court and many High Courts of India. The corpus covers various court case domains as well as more than 1K central government acts. The authors further pretrain Legal-BERT@aueb and Legal-BERT@stanford on the proposed corpus and assess its pre-training effectiveness considering several downstream benchmarks, for the Indian as well as English languages. The performance of the pre-trained models have been compared to the BERT, the original Legal-BERT@aueb and Legal-BERT@stanford. [Bhattacharya et al., 2019a] gather about 17K legal cases of the Supreme Court of India through the website of Westlaw India, which provides documents and related summaries written by domain experts. The authors perform a systematic comparison of several summarization algorithms, such as traditional unsupervised extractive methods (e.g., latent semantic analysis), neural unsupervised extractive methods (e.g., Restricted Boltzmann Machines [Verma and Nidhi, 2018]), and summarization methods specifically conceived for legal documents, both unsupervised (CaseSummarizer [Polsley et al., 2016]) and supervised (LetSum [Farzindar and Lapalme, 2004]).

[Shukla et al., 2022] provide three legal summarization datasets60 gathering documents from the Indian and UK laws. The first dataset is Indian-Abstractive dataset (IN-Abs), with about 7K cases of Indian Supreme Court judgments, obtained from the website of Legal Information Institute of India,61 and corresponding abstractive summaries. The second dataset is Indian-Extractive dataset (IN-Ext), with 50 case documents of the Indian Supreme Court labeled with six rhetorical roles (i.e., facts, argument, statute, precedent, ratio of the decision, and ruling by present court) and extractively summarized by domain experts, providing a summary for each rhetorical segment separately (with the exception of ratio and precedent segments that are summarized together). The third dataset is UK-Abstractive dataset (UK-Abs), with 793 case judgments gathered from the website of the UK Supreme Court, which provides also the press (abstractive) summaries of the cases, divided in three segments: “Background to the Appeal”, “Judgment”, and

55 https://www.faxin.cn/

56 https://github.com/Exploration-Lab/CJPE

57 https://legal-nlp-ekstep.github.io/Competitions/Rhetorical-Role/

58 http://www.westlawindia.com

59 https://www.supremecourt.uk/decided-cases/

60 available at https://github.com/Law-AI/summarization

61 http://www.liiofindia.org/in/cases/cen/INSC/

“Reasons for Judgment”. The authors specify three criteria for the evaluation of methods: document-level summaries, segment-wise evaluations (i.e., how the summary represents the logical rhetorical segments in the legal case), and how the summaries are evaluated by domain-experts.

[Niklaus et al., 2022] augment the Swiss-Judgment-Prediction (SJP) dataset introduced in [Niklaus et al., 2021] via machine translation, i.e., translating a document written in one of the three languages (German, Italian, French) into the remaining two languages. A second version of the dataset is also provided by further augmenting SJP with Indian cases of the ILDC corpus, provided by [Malik et al., 2021]. To this regard, they translate all the Indian cases reported in the corpus to German, French and Italian. The authors evaluate several TLMs in relation to cross-domain (i.e., different legal ares), cross-regional (i.e., different regions) and cross-jurisdiction (from Indian to Swiss) transfer learning, whose discussion is demanded to Section 4.7.

LEX Rosetta [Savelka et al., 2021] propose a multilingual dataset of about 89K annotated sentences for the task of automatic functional segmentation, i.e., segmenting adjudicatory decisions of cases according to the functional role of the parts.62 The sentences are from 807 documents of several courts, gathered from different sources that include seven countries (Canada, Czech Republic, France, Germany, Italy, Poland, USA), and annotated according to the following types: out of scopes (i.e., sentences that are outside the main document, such as editorial content and appendices), heading (i.e., markers of a section), background (i.e., sentences explaining facts, claims and procedural background), analysis (i.e., sentences containing the court reasoning and applications of law to the facts), introductory summary (i.e., a summary of the discussed case), and outcome (i.e., sentences describing the final decision). The dataset is used to test whatever GRU-based models generalize on different contexts (countries), in the segmentation of cases in three functional types (Background, Analysis and Outcome). To this end, the authors analyze the use of multilingual sentence embeddings of predictive models in three versions: training the model on a single context and evaluating transfer learning on other unseen contexts; training the model on a set of contexts and evaluating transfer learning on other unseen contests; and pooling the data of the target context with data from the other contexts. Results have shown that the second and third versions of the model are more effective.

3.4.2 Law code data

SARA [Holzenberger et al., 2020] is a dataset for statutory reasoning on US tax law.63 This dataset is comprised of a set of rules extracted from the statutes of the US Internal Revenue Code (IRC),64 along with a set of questions which would require to refer to the rules for being answered correctly. In fact, IRC contains rules and definitions for the imposition and calculation of taxes, and it is subdivided into sections defining one or more terms (e.g., employment, employer and wages). Each section is normally organized around a general rule, followed by a number of exceptions, and each of its subsections refers to a certain number of slots, which may be filled by existing entities. IRC can hence be framed as a set of predicates formulated in human language so as to require a system to determine whether a subsection applies, and to identify and fill the slots mentioned. Statutory reasoning is addressed as a task of entailment and a task of question answering. In the first task, two paragraphs are manually created for each subsection as test cases: the one describes a case to which the statutes apply, the other one describes a case to which the statutes do not apply. In the second task, test cases are created to predict how much tax a person owes, considering all the statutes and applying arithmetic calculations. In general, this dataset offers some features that allow for reasoning on several aspects, such as reasoning on time, numbers, cross-reference and common-sense. To test the abilities of NLP models on the statutory reasoning problem, the authors have pre-trained the models (e.g., Legal-BERT@jhu) on a large legal corpus obtained extracting tax law documents from the Caselaw Access Project,65 private letters ruling from the International Revenue Service (IRS)66 and unpublished US tax Court cases.

BSARD [Louis and Spanakis, 2022] is a French dataset composed of more than 1.1K legal questions labeled by domain experts with relevant articles selected from the 22K law articles gathered from 32 publicly available Belgian codes.67 The set of questions and associated relevant articles are obtained in collaboration with Droits Quotidiens (DQ), an organization composed of a team of experienced jurists, which every year receives many questions from citizens seeking advice on legal issues, retrieves the articles relevant to the questions asked, answers the questions in a manner comprehensible to the applicant and categorizes the set of questions, legal references and answers with tags. The resulting corpus contains a large number of legal topics (related to social security, work, family, justice and so on). BSARD is intended for statutory article retrieval.

62 https://github.com/lexrosetta/caselaw_functional_segmentation_multilingual

63 https://nlp.jhu.edu/law/

64 https://uscode.house.gov/browse/prelim@title26&edition=prelim

65 https://case.law/

66 https://www.irs.gov/tax-exempt-bonds/teb-private-letter-ruling-some-basic-concepts

67 https://github.com/maastrichtlawtech/bsard

GCL [Papaloukas et al., 2021] is a dataset of about 47K documents regarding Greek legislation and designed for the multi-granular topic classification task, which requires to detect the thematic topic that is representative of a legal document.68 The thematic topics are available in multi-level hierarchy. The main data source for this dataset is the Permanent Greek Legislation Code - Raptarchis, a catalogue of Greek legislation available through the portal e-Themis.69 The portal provides a thematic index for the catalogue, reflecting the thematic hierarchical categories (topics). The hierarchy is dictated by the structural division in volumes, chapters and subjects, which reflect the levels of thematic topics. The classification task in GLC is divided into three sub-tasks, each of them deals with a level of the hierarchy.

Besides statutes, several benchmarks have been developed about contracts of different types. CUAD [Hendrycks et al., 2021] is a dataset specialized for legal contract review.70 It includes more than 500 contracts, varying in type and length, with 13K annotations across 41 category labels provided by legal experts. Such category labels regard general information, such as party names, dates, renewal terms and so on, as well as restrictive covenants and revenue risks. Language models are required to detect the portions of a contract (the clauses) related to each label. Evaluations of such models as BERT, AlBERT, RoBERTa and DeBERTa have highlighted that better performance are influenced by model design and training set size.

[Leivaditi et al., 2020] provide a dataset containing 179 annotated documents regarding lease contracts. The annotations consist of entities (related to parties, property, terms and rent conditions, dates/periods) and red flags, i.e., terms or sentences indicating a potential risk for one o more parties (e.g., break option, guarantee transferable, right of first refusal to lease, bank guarantee), so that the dataset is mainly intended for supporting red flag detection and entity extraction tasks. The documents in the dataset are gathered through the EDGAR database, which is accessible through the US Securities and Exchange Commission (SEC).71 The process of selecting the contracts to be annotated is performed using the BM25 ranking function, which evaluates the relevance of documents w.r.t. keywords/queries that may suggest the presence of red flags. The identification of such keywords/queries and the process of annotation are supervised by domain experts.

ContractNLI [Koreeda and Manning, 2021] contains 607 annotated contracts regarding non-disclosure agreements.72 Such documents are gathered from Internet search engines and EDGAR. By comparing different non-disclosure agreements, a set of 17 hypotheses is obtained. Each document in annotated with respect to its relation with one of the hypotheses (i.e., entailment, contradiction, or not mentioned). If a document is annotated as entailing or contradicting, the spans (i.e., sentence or list item within a sentence) composing the documents are annotated as evidence or not (binary label) of the associated entailment relationship.

ToS [Aumiller et al., 2021] is a dataset consisting of Term-of-Service documents, specifically collected for topic similarity task. The documents include heterogeneous topics due to the different web sources. Some of the most frequent topics regard limitation of liability, law and jurisdiction, warranty, and privacy. Topics are obtained in a hierarchical way splitting the documents into smaller chunks. The authors define and test a system built on TLMs, which revealed to largely outperform segmentation baselines based on TF-IDF and bag-of-words.

MAUD [Wang et al., 2023] is a dataset for the legal multi-choice reading comprehension task and consists of legal texts extracted from 152 public merger agreements gathered from EDGAR. Merger agreements are legal documents regarding public target company acquisitions. In these documents there are special clauses, called deal points, that establish the conditions under which the parties are obliged to complete the acquisition. The deal points are extracted from the merger agreements by lawyers working on the American Bar Association’s 2021 Public Target Deal Points Study (“ABA Study”). Moreover, a set of multiple-choice questions are answered by the lawyers for each deal point. One or more questions can be asked for a deal point, and each question can be answered by one or more answers. MAUD contains 92 questions, 8K unique deal points annotations, 39K question-answer annotations (the examples) and 7 deal point categories (e.g., Conditions to Closing, Deal Protection and Related Provisions, Material Adverse Effect).

[Manor and Li, 2019] provide a dataset containing legal contracts and summaries gathered from two websites, TL;DRLegal73 and TOS;DR,74 the purpose of which is to clarify the content of contracts through summaries. More precisely, the former, which deals mainly with software licences of companies, is used as a source for collecting 84 sets of contract agreement sections and corresponding summaries, whereas 412 sets are obtained from TOS;DR website, which focuses on user data and privacy topics of companies. The quality of the proposed summaries is verified by authors through an analysis of levels of abstraction, compression and readability.

68 https://huggingface.co/datasets/greek_legal_code

69 https://www.secdigital.gov.gr/e-themis/

70 https://github.com/TheAtticusProject/cuad/ The CUAD dataset is also available at atticusprojectai.org/cuad

71 https://www.sec.gov/edgar.shtml

72 https://stanfordnlp.github.io/contract-nli/

73 https://tldrlegal.com/

74 https://tosdr.org/

Another important target of interest for the development of benchmarks is represented by privacy policies. PolicyIE [Ahmad et al., 2021] is a corpus for automating fine-grained information extraction of privacy policies, especially through intent classification and slot filling tasks.75 PolicyIE consists of about 5K intent and 11K slot annotations of several privacy policies related to website and mobile applications. The retrieved policy documents cover four privacy practices that are included in the General Data Protection Regulation (GDPR). Thus, sentences of such policy documents are categorized into the following GDPR-like intent classes: data collection/usage (i.e., what user information is collected, as well as the reason and the modality in which it collected), data sharing/disclosure (i.e., what user information is shared with third parties, as well as the reason and the modality in which it is shared), data storage/retention (i.e., location and time period in which user information will be saved), data security/protection (i.e., what protection measures for user information are taken), other (i.e., privacy practices not included in the other categories). Each sentence is annotated with 18 slot labels, which can be categorized into two overlapping types: type-I, which comprises data and participants to the policy practices (e.g., data provider, data collected, data collector) and type-II, i.e., purposes, conditions, polarity and protection methods. The annotation procedure was performed and monitored by domain experts.

PrivacyQA [Ravichander et al., 2019] contains 1750 questions with over 3.5K annotations of relevant answers regarding to privacy policies of mobile applications.76 Questions for a particular privacy policy are provided by crowdworkers, while the identification of the related answers are committed to legal experts, which also provide metaannotations on the relevance of the question, OPP-115 category, subjectivity, and the likelihood that the answer to the input question is contained into a privacy policy. The authors test the ability of different baselines on two tasks: deciding if a question is answerable, and identifying evidences in the policies for a given question.

PolicyQA [Ahmad et al., 2020] comprises about 25K triplets of question, passage and answer text, derived from segments of website privacy policy documents.77 The corpus is designed so that the answer consists of small portions of the text that better identify the target information in relation to the question. It is curated from the existing OPP-115 corpus [Wilson et al., 2016], which consists of 115 website policies (about 3.7K segments) annotated following domainexperts annotation schemes. The annotation schemes categorize the policy segments in ten data practice categories (e.g., first party collection/use), which are further categorized in several practice attributes (e.g., user type), and each practice attribute is assigned a set of values; for instance, user without account, user with account, other and unspecified. The annotated segments with the associated practice, attribute and value categories are used to form the PolicyQA corpus. Segments and categories are provided to skilled annotators to manually label the questions, for a total of 714 individual questions. The associated QA task is answer span prediction given a policy segment. To this regard, two neural baselines are evaluated, one of this is based on BERT.

[Bui et al., 2021] introduce a corpus78 for the extraction and visualization in privacy policies of personal data objects, i.e., spans in the text expressing types of user data, and related privacy actions. The proposed corpus contains about 4.1K sentences and 2.6K annotated fine-grained data objects concerning several real-world privacy policies. It is obtained exploiting the OPP-115 dataset as a starting point, opting for the top US websites that cover several domains like banking, e-commerce, social network. The data objects in the privacy policies are detected by annotators with experiences in privacy and security research. The data objects are then labeled by the annotators, choosing among “collect”, “not collect”, “share” and “not share” labels. Such labels indicate the privacy action performed on the user data (collection or sharing). The resulting annotation has also been revised with a semi-automated process to improve the annotation quality, which involves the use of tools for correction and pre-annotation. The final corpus is used to train and evaluate a neural NER model, called PI-Extract, on the extraction of personal data objects and privacy actions. The task is formulated as a sequence labeling problem, which is to assign a label for each token of a given sentence.

Relevant benchmarks have been built by considering multilingual and/or multi-task evaluation scenarios. For instance, COVID-19 Exceptional Measures [Tziafas et al., 2021] is a collection of legal, manually-annotated documents regarding COVID-19 exceptional measures across 21 European countries for multilingual classification task. To this end, feature-based methods and XLM-RoBERTa pre-trained on the collection have been evaluated, showing best results in the use of the domain-adapted TLMs. MultiEURLEX [Chalkidis et al., 2021a] consists of European Union laws, annotated with multiple labels and translated in 23 languages, with legal topic classification as supported task.79 The authors experiment with monolingual BERT models, pre-trained in Romance, Slavic, Germanic and Uralic languages, and multilingual models (mT5 and XLM-RoBERTa) which are evaluated for cross-lingual legal text classification on this benchmark. The experimentation focuses mainly on the zero-shot cross-lingual transfer, namely one-to-many setting, in which a multilingual model is fine-tuned on one language and evaluated in the other 22 languages. However, models are also evaluated on

75 https://github.com/wasiahmad/PolicyIE

76 https://github.com/AbhilashaRavichander/PrivacyQA_EMNLP

77 https://github.com/wasiahmad/PolicyQA

78 https://github.com/um-rtcl/piextract_dataset

79 https://huggingface.co/datasets/multi_eurlex

one-to-one (training and testing on the same language) and many-to-many (training and testing on all languages) settings. Adaptation strategies on the multilingual models are applied to avoid the catastrophic forgetting of multilingual knowledge when fine-tuning on one source language only, significantly improving the zero-shot cross-lingual transfer. In the one-to-one setting, multilingual models prove to be competitive against monolingual models. EUR-Lex-Sum [Aumiller et al., 2022] is a multi- and cross-lingual dataset containing about 32K pairs of documents and summaries in 24 languages.80 Each language comprises up to 1500 pairs. Documents consist of legal acts retrieved from the European Union law website,81 375 of which are legal acts written in each of the languages. Summaries are structured following particular guidelines, for example there are sections dedicated to key points, background, key terms and so on. The authors evaluate several zero-shot extractive baselines, one of which is a version of LexRank that receives chunks (based on existing separators in the text) and uses embeddings generated by SBERT, and crosslingual baselines, including one based on LED with capability of greedily chunking the text if document sizes exceed the model’s maximum input length.

LegalNERo [Pais et al., 2021] contains 370 legal documents, designed for NER tasks and manually annotated according to five coarse-grained classes: person, location, organization, time expressions, and legal document references.82 The documents are extracted from a larger corpus, called MARCELL-RO,83 containing several documents from national legislation (e.g., decrees, regulation, laws) of seven countries, Romania included. The authors evaluate a baseline based on BiLSTM and CRF, which takes as input a text representation obtained through FastText.84 PLUE [Chi et al., 2023] is a multi-task benchmark that collects several privacy policy datasets (including the aforementioned datasets) to evaluate NLP methods over various privacy policy tasks, namely classification, questionanswering, intent classification, slot filling, name entity recognition.85 In particular, PLUE contains the following datasets: OPP-115, APP-350 [Zimmeck et al., 2019], PrivacyQA, PolicyQA, PolicyIE, and PI-Extract (the dataset used in [Bui et al., 2021]). To enable model pre-training on the domain, [Chi et al., 2023] also provide a large corpus using MAPS [Zimmeck et al., 2019], a corpus of 441K mobile application privacy policies, and the Princeton-Leuven Longitudinal Corpus (PLLC) [Amos et al., 2021], containing bout 130K privacy policies of websites. From the combination of the two corpora, a pre-training corpus with 332M words is created. The authors evaluated several TLMs as baseline, previously pre-trained on MAPS and PLLC and then fine-tuned on the PLUE datasets.

[Drawzeski et al., 2021] introduce a multilingual corpus for the analysis of fairness of online terms of service. The dataset contains 100 contracts, derived from 25 ToS documents annotated in four languages (English, German, Italian and Polish) and extracted from an existing corpus [Lippi et al., 2019]. In each contract, potentially unfair clauses are labeled with one of the nine possible unfairness categories, namely arbitration, unilateral change, content removal, jurisdiction (i.e., which courts will have jurisdiction over disputes under the contract), choice of law (i.e., which law will regulate the contract), limitation of liability, unilateral termination, contract by using (i.e., the use of a service binds the consumer to the terms of use of the service without being required to indicate that she/he has read and accepted them), and privacy included (i.e., the use of the service implies the acceptance of the related privacy policy). Moreover, for each category the degree of the unfairness is indicated with three numerical values: 1 for clear fairness, 2 for potential unfairness, and 3 for clear unfairness. Four types of discrepancies are observed across the language versions of the same contract, relating to the sentence structures, errors or inaccuracies in translation into the target languages, absence of some clauses in the different language versions and the choice of legal terminology.

3.4.3 Hybrid data

LexGLUE [Chalkidis et al., 2022b] is a collection of seven existing legal NLP datasets86 for evaluating models across several legal tasks, which include multi-label classification, multi-class classification and multiple choice questionanswering: – ECtHR Tasks A & B for multi-label classification of allegations regarding violations of the European Convention of Human Rights (ECHR) provisions. The dataset is used to test models on article violation prediction (Task A, [Chalkidis et al., 2019a]) and alleged violation prediction (Task B, [Chalkidis et al., 2021c]). In both Task A and Task B, the total number of ECHR articles is reduced to 10, discarding those that are rarely discussed, cannot be violated or are not depending on the facts of a case.

80 https://github.com/achouhan93/eur-lex-sum

81 https://eur-lex.europa.eu/

82 https://lod-cloud.net/dataset/racai-legalnero

83 https://marcell-project.eu/

84 https://fasttext.cc/

85 https://github.com/JFChi/PLUE

86 https://huggingface.co/datasets/lex_glue

  • The English part of Multi-EURLEX[Chalkidis et al., 2021a] for multi-label classification on European Union (EU)

legislation. It includes different labeling granularity levels (from 21 to 7K EuroVoc concepts). In LexGLUE, the 100 most frequent labels from the second level of granularity (567 total labels) are considered.

  • SCOTUS87 for multi-class classification on US Supreme Court opinions. In LexGLUE, SCOTUS opinions are

associated with 14 issue areas (e.g., Economic Activity, Criminal Procedure, Civil Rights) obtained through the Supreme Court DataBase.88

  • LEDGAR [Tuggener et al., 2020] for multi-class classification on contract provisions gathered from the US Securities

and Exchange Commission (SEC) documents. In LexGLUE, a subset of the original dataset is considered, derived from the 100 most frequent labels.

  • UNFAIR-ToS [Lippi et al., 2019] for multi-label classification on contracts between providers and users of services

(i.e., terms of service) with 8 classification labels.

  • CaseHOLD [Zheng et al., 2021] for multiple choice question-answering about holdings of US court cases gathered

from the Harvard Law Library case law corpus.

[Chalkidis et al., 2022b] evaluate BERT, RoBERTa, DeBERTa, Longformer, BigBird, Legal-BERT@aueb and Legal- BERT@stanford on LexGLUE. For ECtHR and SCOTUS, the authors employ a hierarchical variant of the models, following [Chalkidis et al., 2021c]. The domain-adapted models, i.e., Legal-BERT@aueb and Legal-BERT@stanford, performed overall better than competitors, with large improvement in US case law data.

FairLex [Chalkidis et al., 2022c] comprises four datasets (ECtHR, SCOTUS, Swiss-Judgment-Prediction and CAIL [Wang et al., 2021]) for the evaluation of model fairness.89 To this end, it includes three groups of fairness attributes: demographics, regional and legal area. The first group regard biases relating to factors such as gender, age and race. The second and third groups aim to alleviate disparity, respectively, in regions of a given jurisdiction and in different areas of law. Moreover, it contains five languages (Chinese, English, French, German and Italian) and four jurisdictions (China, USA, Switzerland and European Council). The authors also provide four hierarchical BERT-based models, one for each dataset, as baselines for the benchmark. Such models are similar to [Chalkidis et al., 2021c] and further pre-trained on the specific dataset. The models are warm-started from MiniLMv2 checkpoints, using a distilled version of RoBERTa for the English version and a distilled version of XLM-R for the other languages. Experimental results show that the models have some disparity in performance. In particular, in the ECtHR task there is a disparity related to defendant state and applicant’s gender, while for the FSCS task there is disparity related to language (Italian versus French and German), legal areas (penal law versus the others) and court regions (Switzerland courts versus federation courts). Court regions disparity is noted also in the CAIL task (Beijing courts versus Sichuan courts). However, disparities in performance can be influenced by general factors based on the distribution of data. [Bhattacharya et al., 2020a] collect documents of the Supreme Court of India and statutes in the Indian judiciary through the Thomson Reuters Westlaw India website,90 for a task of document similarity. In particular, they propose and evaluate an approach based on precedent citation network augmented with the hierarchy of legal statutes, in order to encompass also the knowledge of legal text’s hierarchy.

Pile of Law [Henderson et al., 2022] is a legal corpus designed to address ethical issues in the pre-training phase.91 It is collected from 35 EU and US data sources and covers several legal sub-domains, such as court opinions, administrative rules, contract and legislative records, for a total of about 10M documents. Such data has already implicit filters which reflect legal standards of the specific jurisdiction, but the authors note that not all of them have been detected and such norms can vary respect to the jurisdiction. By performing filtering rules on the data, the proposed dataset respects the legal norms of governments and courts regarding the presence of toxic (offensive or obscene terms) or private content and prevents a model to learn such information. The authors demonstrate that the dataset can be used to learn contextual privacy/toxicity rules, as it respects the variation in the different privacy/toxicity norms. For example, they demonstrate models pre-trained on Pile of Law can learn contextual privacy rules with regard to the use of pseudonyms in immigration court and in civil litigation. In particular, a BERT-based model is trained on the data to predict whether a pseudonym should be used. Moreover, [Henderson et al., 2022] provide a baseline by pre-training BERT on the corpus from scratch. The resulting model, called PoL-BERT-Large, is fine-tuned and evaluated for a legal reasoning task on CaseHOLD, reaching about the same performance reported in LexGLUE, and outperforms BERT. However, it does not outperform a BERT model trained exclusively on case law data. This is probably due to the extreme data diversity in the corpus that limits the pre-training efficacy respect to competitors trained exclusively on task-oriented data such as CaseHOLD.

...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 BringingOrderIntotheRealmofTranAndrea Tagarelli
Candida M Greco
Bringing Order Into the Realm of Transformer-based Language Models for Artificial Intelligence and Law10.48550/arXiv.2308.055022023