2022 OntheEffectivenessofPreTrainedL

(Song et al., 2022) ⇒ Dezhao Song, Sally Gao, Baosheng He, and Frank Schilder. (2022). “On the Effectiveness of Pre-trained Language Models for Legal Natural Language Processing: An Empirical Study.” In: IEEE Access, 10. doi:10.1109/ACCESS.2022.3190408

Subject Headings: LegalBERT, Legal NLP.

Notes

It presents an empirical study evaluating PLMs on legal NLP tasks compared to baselines.
It compares PLMs on 8 datasets covering 5 tasks: legal text-item classification, legal QA, legal summarization, and legal information retrieval.
It shows PLMs outperform baselines overall but gains vary by task difficulty.
It finds domain PLMs provide some benefits but have limited transferability.
It shows state-of-the-art general NLP systems have limited gains over PLMs on legal data.
It indicates a baseline method outperforms PLMs on retrieval, showing issues differentiating legal texts.
It provides guidelines for legal NLP practitioners on model selection based on factors like task type, model capabilities, and compute costs.

Cited By

http://scholar.google.com/scholar?q=%222022%22+On+the+Effectiveness+of+Pre-trained+Language+Models+for+Legal+Natural+Language+Processing%3A+An+Empirical+Study

Quotes

Abstract

We present the first comprehensive empirical evaluation of pre-trained language models (PLMs) for legal natural language processing (NLP) in order to examine their effectiveness in this domain. Our study covers eight representative and challenging legal datasets, ranging from 900 to 57K samples, across five NLP tasks: binary classification, multi-label classification, multiple choice question answering, summarization and information retrieval. We first run unsupervised, classical machine learning and/or non-PLM based deep learning methods on these datasets, and show that baseline systems’ performance can be 4%~35% lower than that of PLM-based methods. Next, we compare general-domain PLMs and those specifically pre-trained for the legal domain, and find that domain-specific PLMs demonstrate 1%~5% higher performance than general-domain models, but only when the datasets are extremely close to the pre-training corpora. Finally, we evaluate six general-domain state-of-the-art systems, and show that they have limited generalizability to legal data, with performance gains from 0.1% to 1.2% over other PLM-based methods. Our experiments suggest that both general-domain and domain-specific PLM-based methods generally achieve better results than simpler methods on most tasks, with the exception of the retrieval task, where the best-performing baseline outperformed all PLM-based methods by at least 5%. Our findings can help legal NLP practitioners choose the appropriate methods for different tasks, and also shed light on potential future directions for legal NLP research.

...

SUMMARIZATION

For this task, we selected the JRC-Acquis [18] and the BillSum [19] datasets for our experiments. We did not find a download link for the dataset by Bhattacharya et al. [101] and the LegalOps [20] dataset was not available while we were conducting our summarization experiments. The AILA2021 dataset and the one by Manor and Li [108] are of substantially smaller scale than the two selected ones, and are thus less ideal for our study.

The JRC-Acquis dataset consists of about 23K samples and the goal is to produce summaries for EU legislation, international agreements, treaties, among other types of documents. BillSum also contains about 23K samples, and the task is to summarize U.S. Congressional and state bills. One major difference between the two datasets is their input and output length. In terms of input documents, JRC-Acquis and BillSum have an average of 2,000 and 1,300 words respectively; for the targets, BillSum’s summaries are 182 words long on average, substantially longer than JCR-Acquis’ average length of 30 words. The long inputs from both datasets certainly pose significant challenges for several mainstream PLMs (e.g., BERT and RoBERTa where there is a hard limit of 512 tokens); more importantly, the much longer target texts from BillSum make the summarization process both more technically challenging and time-consuming.

...

FUTURE WORK

In all of our experiments, we either feed entire documents to PLMs whenever possible or truncate them to some maximum length. One potential future direction lies in being more selective about the texts we send to PLMs, i.e., Text Selection. Take the multi-label POSTURE50K and the case retrieval COLIEE-2021 datasets as examples. Cases from these datasets can have thousands of words, but may consist of multiple ‘‘sections’’ (either explicitly or implicitly) that each address a different aspect of the case, such as background, analysis and conclusion. For retrieving relevant cases, the ‘‘summary’’ paragraphs could be more helpful; however, for the task of extracting legal procedural postures, sections that provide detailed analyses of the major legal issues in the case would be needed, since legal postures require substantive discussion. Therefore, selecting the potentially most helpful paragraphs or sections for a given task may lead to better performance. At the same time, this reduces the need to ingest longer inputs, thus lowering computational costs.

There have been several studies on fairness, bias and stereotypes in word embeddings both in the general domain [122]–[124] and in the legal domain [125]. Similarly, several recent works have also exposed unintended social biases in PLMs [126]–[129]. For example, in the context of legal judgment prediction, Wang et al. [125] showed the existence of regional sentencing differences on the same types of crimes. Although not the focus of this paper, considering the high stakes involved in machine-assisted decision-making in the legal domain and the rising adoption of PLMs, we believe future research around these topics deserves more attention from (legal) NLP researchers, domain experts and social scientists.

Another area of study might be around data annotation. As mentioned earlier, obtaining high-quality and large-scale annotated legal data is expensive in several aspects. Therefore, it would be interesting to examine how different techniques (such as active learning and data augmentation) could facilitate the annotation process, and explore their impact on final model performance. Finally, given the existence of datasets in different languages (Table 1), multilingual PLMs could be another interesting research topic, especially on the potential transferability across different languages.

...
...

TABLE 10. Optimal parameter settings for baseline systems. MLP represents Multilayer perceptron. vocab and lr denote vocabulary and learning rate respectively. For BiLSTM, the ‘‘max input’’ refers to the number of words separated by whitespace.

...
...

TABLE 11. Optimal parameter settings for general-domain and domain-specific PLMs. lr denotes learning rate. ‘‘max input’’ and ‘‘max output’’ (whenever applicable) are with respect to the number of tokens decided by the corresponding PLM tokenizers.

CONCLUSION

In this paper, we presented an extensive and up-to-date empirical study on the effectiveness of pre-trained language models (PLMs) for legal NLP. Rather than focusing on one or two specific types of NLP tasks, our study covers eight legal datasets that belong to five different types of tasks, including Binary Classification, Multi-label Classification, Summarization, Multiple Choice Question Answering and Information Retrieval, all of which have very critical applications in the legal domain. We ran simple baselines in order to set up the basis for further comparison. Then, we experimented with both mainstream general-domain PLMs and two recent legal-specific PLMs on these tasks, and compared their performance. Importantly, we also selected and applied six recent general-domain state-of-the-art sys- tems to the legal datasets in order to explore their out-of- domain generalizability. Finally, we summarize and discuss our findings, and also present several potentially promising future directions for legal NLP research, such as in-domain transferability, text selection and fairness/bias/stereotype detection.

In future work, rather than truncating long inputs, one promising direction is to perform text selection, which could potentially lead to better performance at reduced cost. Also, data augmentation techniques could be considered, e.g., for the less frequent labels in multi-label classification tasks, among others. Finally, fairness and bias in word embeddings and language models deserve continued research efforts.

For industrial practitioners, we hope that our study can serve as a practical guide for selecting the appropriate methods for different tasks, considering the trade-offs between performance gains and computational costs. At the same time, we hope that the extensive experiments on different tasks, especially the comparisons between different types of approaches, can shed light on the most promising and pressing research directions for legal natural language processing.

... ...

TABLE 12. Optimal parameter settings for state-of-the-art approaches. lr denotes learning rate. The ‘‘max input’’ and ‘‘max output’’ (whenever applicable) are with respect to the number of tokens decided by the corresponding PLM tokenizers. We mainly tuned the parameters displayed in this table and followed the original papers for the other settings. On each of the two multi-label datasets, LightXML obtained its best performance on the same setting when using different PLM kernels. On CaseHOLD, MMM achieved the best performance on the same setting with different PLM kernels.

}}

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2022 OntheEffectivenessofPreTrainedL	Dezhao Song Sally Gao Baosheng He Frank Schilder			On the Effectiveness of Pre-trained Language Models for Legal Natural Language Processing: An Empirical Study				10.1109/ACCESS.2022.3190408		2022