2023 DALEGenerativeDataAugmentationf

(Ghosh et al., 2023) ⇒ Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar, S Ramaneswaran, S Sakshi, Utkarsh Tyagi, and Dinesh Manocha. (2023). “DALE: Generative Data Augmentation for Low-Resource Legal NLP.” doi:10.48550/arXiv.2310.15799

Subject Headings: Low-Resource Legal NLP, Generative Data Augmentation Framework.

Notes

It presents DALE, a novel generative data augmentation framework for low-resource legal NLP tasks.
It proposes a novel unsupervised text denoising objective for DALE based on selective masking of co-located spans in legal documents. This helps acquire knowledge about legal concepts, principles, and language usage.
It shows that DALE generates more coherent and diverse augmentations of legal documents compared to prior approaches.
It demonstrates DALE's effectiveness by evaluating it on 13 datasets across 6 tasks and 4 low-resource settings. DALE outperforms baselines like LLMs with gains of 1-50%.
It highlights through qualitative analysis that DALE generates augmentations with novel contexts while preserving label consistency, unlike other approaches.
It conducts extensive ablation studies to validate the effectiveness of the key components of the DALE framework.
It releases the code, datasets, and models to reproduce the results.
It evaluates DALE on legal text documents from the following datasets:
- - For multi-class law-related text classification: SCOTUS, LEDGAR, ILDC, OTS-UL
  - For multi-label law-related text classification: ECtHR Task A and B, EURLEX, UNFAIR-ToS, OTS-CT
  - For law-related named entity recognition: EDGAR, Indian-Legal-NER
  - For law-related rhetorical role prediction: BUILD
  - For document-level law-related NLI: ContractNLI
  - For multiple-choice law-related QA: CaseHOLD

Cited By

http://scholar.google.com/scholar?q=%222023%22+DALE%3A+Generative+Data+Augmentation+for+Low-Resource+Legal+NLP

Quotes

Abstract

We present DALE, a novel and effective generative Data Augmentation framework for low-resource LEgal NLP. DALE addresses the challenges existing frameworks pose in generating effective data augmentations of legal documents - legal language, with its specialized vocabulary and complex semantics, morphology, and syntax, does not benefit from data augmentations that merely rephrase the source sentence. To address this, DALE, built on an Encoder-Decoder Language Model, is pre-trained on a novel unsupervised text denoising objective based on selective masking - our masking strategy exploits the domain-specific language characteristics of templatized legal documents to mask collocated spans of text. Denoising these spans helps DALE acquire knowledge about legal concepts, principles, and language usage. Consequently, it develops the ability to generate coherent and diverse augmentations with novel contexts. Finally, DALE performs conditional generation to generate synthetic augmentations for low-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13 datasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our baselines, including LLMs, qualitatively and quantitatively, with improvements of 1%-50%.

1 Introduction

With recent advances in deep learning for NLP, many systems have achieved state-of-the-art and near-human performance on benchmark Natural Language Understanding (NLU) datasets (Wang et al., 2018, 2019). Following this closely, the legal NLP literature has also been thriving with new datasets and frameworks (Chalkidis et al., 2021c; Niklaus et al., 2023; Chalkidis* et al., 2023). However, one common observation is that most techniques, built and evaluated on NLP tasks involving everyday natural language, do not easily transfer to the legal domain (Zhong et al., 2020a; Chalkidis et al., 2020; Katz et al., 2023). Legal language, also known as legalese and commonly classified as a “sublanguage” (Sinsheimer, 2007; Williams, 2007; Haigh, 2023), is governed by logical rules and is distinct from everyday natural language in terms of specialized vocabulary, morphology, complex syntax, and knowledge-specific semantics, which makes the transfer difficult. Interestingly, modern Large Language Models (LLMs), both open- and closed-source (like ChatGPT), that have shown to possess excellent reasoning abilities and achieved impressive performance in zero-shot NLU tasks (HuggingFace, 2023), often do not perform well in Legal Language Understanding (LLU) tasks (Chalkidis, 2023). With state-of-the-art instructiontuned LLMs as our baselines, we also show that LLMs struggle to generate effective augmentations for LLU tasks and fail to preserve label consistency when the source legal document is long.

1Code: https://github.com/Sreyan88/DALE

Original 1: Buyer has full power and authority to enter into this Agreement. Method Original 2: The Borrower is organized, validly existing and in good standing under the laws of the jurisdiction of its organization. 1: buyer has wide cut power and authority to enter into this agreement EDA 2: the borrower is organized validly existing and in good standing under the (Wei and Zou) laws the jurisdiction its organization 1: Purchaser has full-of-the-moon major power and self-assurance to enter into Legal-EDA this agreement. (Perçin et al.) 2: The borrower is organized, validly existing and in just stand up under the law of the legal power of its organization. 1. buyer is full custody and agrees to enter into this agreement. SSMBA 2: the borrower is organized, validly existing and in good peace under the laws (Ng et al.) in the jurisdiction or and organization 1: Who has the authority to do this? GENIUS 2: The Borrower is organized into three categories: validly existing, validly (Guo et al.) new, and validly old. The first category is new. The second category is old. 1: The buyer possesses complete authority to engage in this agreement. ChatGPT 2: The Borrower is legally established, currently active, and in compliance with the laws of the jurisdiction where it is organized. 1: The Company has full power and authority to enter into this Agreement DALE and to perform its obligations hereunder. (ours) 2: The Company is a corporation duly organized, validly existing and in good standing under the laws of the State of Delaware.

Table 1: Comparison of augmentations generated using DALE and our baselines. DALE generates coherent and diverse augmentations in addition to introducing new context while preserving label consistency (1.Payments 2.Authority).

Improving the performance of deep learning arXiv:2310.15799v1 [cs.CL] 24 Oct 2023 models on downstream LLU tasks requires sufficient good-quality training data. Beyond being an expensive and noisy task (Abad and Moschitti, 2016; Nguyen et al., 2017), high-quality annotation in specialized domains like legal or biomedical is prohibitively expensive due to the requirement of expert and requisite domain knowledge that lay annotators may not possess. One common approach taken by researchers for NLU tasks is data augmentation, either online (Guo et al., 2019; Ng et al., 2020a; Sun et al., 2020; Guo, 2020; Sawhney et al., 2021) or offline in the form of generated synthetic data (Wei and Zou, 2019; Kumar et al., 2020; Zhou et al., 2021; Kim et al., 2022; Guo et al., 2022a). Though most offline techniques perform well when employed for low-resource NLU tasks, we show that they tend to struggle in almost all LLU tasks, often generating in-coherent and non-diverse augmentations, eventually leading to sub-optimal performance. We attribute this to algorithmic biases of existing augmentation approaches towards natural language and the varying characteristics of legal language (see Section 2 for more details). For example, most of these techniques often just tend to rephrase the source document, which is ineffective for LLU tasks due to the formalized nature of legal language, adversely affecting both generation diversity and downstream model generalization. Longpre et al. also emphasize that task-agnostic augmentation frameworks lead to reduced performance. To overcome these issues, researchers in specialized domains (e.g., biomedical) have developed specialized algorithms (Kang et al., 2020; Ghosh et al., 2023), but to the best of our knowledge, no such approach has been proposed for the legal domain.

Main Contributions. In the paper, we present DALE, a novel data augmentation technique based on conditional generation for low-resource legal NLP. Based on our initial analysis of legal documents, we propose that augmentations enhancing LLU task performance can be achieved by not just rephrasing documents but also by modifying existing contexts or introducing novel ones. DALE, designed to perform this, builds on BART (Lewis et al., 2019) and is first pre-trained on a large-scale unlabeled legal corpus using a novel text denoising objective based on selective masking. Specifically, we leverage the inherent properties of templatized legal language to mask co-occurring and highly correlated spans of text in a legal document and avoid masking random and emerging entities or facts. Our masking algorithm preserves valuable hints and prevents the model from learning redundant knowledge by not asking it to reconstruct document-specific entities or facts. Rather, it promotes acquiring broad legal knowledge and knowledge of legalese that enables DALE to advance its capability in generating augmentations of legal documents with novel contexts that possess remarkable levels of coherence and diversity. We call this masked document a template, and it serves as input to DALE for denoising-based pretraining. We optionally fine-tune DALE on the downstream dataset, followed by conditional generation to generate augmentations. We show that our domain-specific sentence corruption algorithm enables DALE to generate diverse and coherent augmentations of legal documents, which are entityrich, semantically complex, and formal in nature. To summarize, our primary contributions are:

1. We propose DALE, the first generative data augmentation framework designed for lowresource legal NLP.

2. Through extensive empirical evaluation on 6 LLU tasks, 13 datasets, and 4 low-resource settings, we show that DALE outperforms all prior works with significant gains of 1%-50%.

3. Additionally, through extensive ablative experiments and qualitative comparison, we show that DALE generates much more diverse and coherent augmentations than prior works.

2 Related Work

Legal NLP. Recently, the legal NLP literature has been flourishing with new resources like datasets (Leitner et al., 2019; Zhong et al., 2020b; Zheng et al., 2021; Hendrycks et al., 2021), benchmarks (Chalkidis et al., 2021c; Niklaus et al., 2023; Chalkidis* et al., 2023) and PLMs (Chalkidis et al., 2020; Xiao et al., 2021; Mamakas et al., 2022; Niklaus and Giofré, 2022). However, despite much progress, the specialized domain of legal language lags behind in available resources when compared to natural language or domains like bio-medical (Katz et al., 2023). As also mentioned earlier, most techniques employed for building better deep learning NLU models do not transfer well to the legal domain due to characteristics that make it distinct from natural language (Morrison, 1989; Nair and Modani, 2023; Glogar, 2023), including its highly formal, technical, entity-rich and knowledge-rich nature, along with semantically complex phrases. Simply put, the task of training machines to “understand” legal language has proven to be nontrivial (Katz et al., 2023). For quite some time, researchers tried to teach models to solve complex LLU problems through prior findings in NLU, e.g., pre-training LMs (Chalkidis et al., 2020). However, this has come with varying success (Zheng et al., 2021). Exploiting domain-specific characteristics to build custom pre-training strategies has shown better success (Nair and Modani, 2023; Chalkidis* et al., 2023), and we emphasize that there is a similar need for all tasks in legal NLP.

Orig: Did the superior court abuse its discretion in dismissing Morgans appeal for failure to exhaust administrative remedies ? Preserves Hints Avoids Randomness RM: <mask> abuse <mask> discretion <mask> Morgans appeal <mask> to exhaust administrative <mask>   GM: <mask> abuse its discretion <mask> dismissing Morgans appeal <mask> to exhaust administrative <mask>   PMI: Did the <mask> abuse its discretion in dismissing <mask> appeal for failure to exhaust <mask> ?   DM: <mask> in dismissing Morgans <mask> to exhaust administrative <mask> ? ✓ ✓ <mask> in failing to allow Hertz to intervene as a pro se plaintiff ? ✓ ✓ <mask> in awarding attorneys fees to moore in the <mask> 12,560.37? ✓ ✓ } Other sentences with the same co-occurring span

Figure 1: Comparison of various span masking algorithms in legal documents rich in emerging entities and case-specific facts. RM stands for random masking, GM stands for GENIUS extreme masking (Guo et al., 2022a), PMI stands for PMI masking (Levine et al., 2021) and DM stands for our proposed DALE masking. Unlike other masking algorithms that make a model learn redundant knowledge through denoising entities or random tokens, our proposed masking formulation promotes learning of broader legal knowledge and legalese by masking co-occurring spans that consistently provide high signals.

Data Augmentation for Low-Resource NLP. Data augmentation, both online (Guo et al., 2019; Ng et al., 2020a; Sun et al., 2020; Kumar et al., 2020; Guo, 2020; Sawhney et al., 2021) and offline (Wei and Zou, 2019; Kumar et al., 2020; Zhou et al., 2021; Kim et al., 2022; Guo et al., 2022a), has seen great success in overcoming the data scarcity issue in low-resource NLU tasks. While the former employs techniques like latent space interpolation or mixing, the latter is based on generating synthetic data that can be augmented with the original data to aid low-resource or few-shot learning (Chen et al., 2023). However, though the data scarcity issue is exacerbated in specialized domains like legal, where annotation becomes prohibitively expensive (Yang et al., 2019), domain-specific data augmentation techniques in literature are thin and almost non-existent, especially for the legal domain. Perçin et al. (2022) proposes the only legal domainspecific approach for data augmentation. However, they substitute phrases from the WordNet (Miller, 1995), failing to generate diverse augmentations for legal text by only editing common natural language phrases in the WordNet. For example, the performance of back-translation (Yu et al., 2018) is affected by the inability of machine-translation systems to translate entity-rich and formal legal language effectively. The work closest to ours is Guo et al. (2022a) and Wang et al. (2022), where the PLM is trained on a keyword-to-sentence reconstruction task. However, these systems rely on unsupervised keyword discovery, which is naturally biased towards rare entities prevalent in legal documents. Denoising entities are case- or documentspecific and would lead a model to learn redundant knowledge by reconstructing the case-specific fact around it, of which it has no prior knowledge. Without informed masking, a similar conclusion could be made for other PLM-based approaches in literature (Kumar et al., 2020; Guo et al., 2022a).

3 Methodology

3.1 DALE Pre-training

Primary Goal. Our primary goal is to devise a denoising-based seq-to-seq pre-training algorithm crafted to favor our final objective, i.e., generating diverse and coherent data augmentations. Sentence denoising is better suited to our task (compared to other methods like prompt- or instruction-tuning) as it gives us better control over long-document generations (explained further in Appendix E). The type of knowledge acquired through denoising objectives has been seen to be highly dependent on the masking algorithm (Sadeq et al., 2022). Thus, to achieve our objective and devise a suitable masking algorithm, we first try to answer a question crucial to the success of our approach: Which attributes should an augmentation of a legal document possess to be considered effective, enabling improved generalization in downstream LLU tasks? After conducting an analysis of legal documents, we hypothesize that formal language used in the Optimal Context Selection The depositions were taken in Savannah, Georgia on the scheduled date without the wife or her counsel being present. ..... Both parties resided in Alabama before and during the 1979 divorce proceedings. ..... Since in personam jurisdiction over the husband existed in the original divorce action , it carried over to the present modification proceedings but the notice to the husband of those proceedings had to satisfy procedural due process. ..... We agree with the trial court that it had in personam jurisdiction over the husband in these modification matters. .... 1 Correlated Span Extraction Pre-training Corpus Draw The depositions were taken in Savannah, Georgia on the scheduled date without the wife or her counsel being present. ..... Both parties resided in Alabama before and during the 1979 divorce proceedings. ..... Since in personam jurisdiction over the husband existed in the original divorce action , it carried over to the present modification proceedings but the notice to the husband of those proceedings had to satisfy procedural due process. ..... We agree with the trial court that it had in personam jurisdiction over the husband in these modification matters. .... Irrelevant Sent. <mask> Savannah, Georgia on the scheduled date <mask> the wife or her counsel being present. .... Since in <mask> existed in the original <mask>, it carried over to the present modification proceedings but the notice to the husband of those proceedings had to <mask>. ..... We agree with <mask> that it <mask> in these modification matters. sp1 sp2 Span Ranking . . . . Scores: spT Selective Masking Correlated Spans Optional Fine-tuning Denoising pre-training <mask> payment due <mask> payment is <mask> Dp DALE Discounted Pointwise Mutual Information (PMI) Random Masks Low-Resource Downstream LLU Dataset Augmentation Generations The Issuer shall fail to make any payment due hereunder within 3 Business Days of when such payment is due and payable. The Borrower shall make all payments due hereunder within five (5) business days of the due date. The payment shall be made by wire transfer to the Lender's account. If the Borrower fails to make the payment on time, the Borrower shall be liable for late fees at the rate of 1.5% per month. If the Issuer does not make a payment within 3 Business Days after it becomes due and payable, it will be considered a failure to meet the payment obligations outlined in this agreement. 3 Legal Doc Legal Doc 2 PageRank Legal Doc A B D E C F G H

Figure 2: Illustration of DALE.⃝1 We extract all correlated spans from a legal corpus using our discounted PMI formulation. ⃝2 We shorten a legal document by selecting only the top-k sentences that are the most relevant to the document and removing the rest. ⃝3 We rank all the spans based on their importance and length using our novel scoring metric. Finally, we create a template by retaining the top-p spans and masking all other spans with added randomness. This process is followed by optional fine-tuning on the downstream dataset and conditional generation of augmentations from corrupted legal documents.

legal domain rarely allows for the occurrence of a rephrased version of the original document, unlike in everyday natural language. In fact, effective augmentations need to add new context to legal documents or modify existing ones. What to mask? To modify the existing or introduce a novel context in legal documents while maintaining the formal legal style and plausibility of events in the generated context, DALE, like a legal practitioner, should possess both broad legal knowledge and knowledge of legalese. However, acquiring either from legal documents with complex semantics and syntax is not trivial. Legal documents, written by law practitioners, consist of clauses that are primarily document- (or case-) specific facts. The text is entity-rich, and entities are usually emerging as they are unique to that document. Beyond entities, these documents also contain text fragments outlining these entities and can be seen as an outcome of broad legal knowledge possessed by the practitioner. These co-occurring fragments, generally genre- or corpus-specific, are commonly reused by practitioners across documents. Their presence is a core property of legalese which can be attributed to its trait of being a formalized language (Nair and Modani, 2023). Fig. 1 shows an example sentence from a document with such a structure (more examples in Table 17). Thus, we hypothesize that learning to denoise these fragments with appropriate context and hints will eventually lead DALE to acquire knowledge about legal concepts, principles, and language usage by consistently providing high signals and avoiding noise. This will in turn allow DALE to generate consistent, plausible, and diverse augmentations. Fig. 1 pictorially describes the problem with current masking algorithms and how our proposed algorithm favors our task. We call our final masked or corrupted document a template and denote it as T . DALE pre-training involves multiple steps for template creation followed by training to denoise these templates. We next describe each step to create T , which is done corpus-wise due to the variability of legalese across domains and genres. (1) Correlated Span Extraction. To extract these reusable text fragments from an unlabeled legal corpus without supervision, we identify these fragments as correlated spans of tokens. First, we denote the set of all n-gram spans in a corpus C, as NC = {n0, ⋯, nK}, where every span nk={w1,⋯,wn}. Here n ranges from 2 to q. Our objective now is to extract a set of distinct spans SC = {sp0, ⋯, spT} from NC where each span spt exhibits high co-occurrence over the corpus. Though modeling such correlations is widely studied in computational linguistics (Zuidema, 2006; Ramisch et al., 2012), we choose to use Pointwise Mutual Information (PMI) (Fano, 1961) as a metric to score all individual n-grams in a corpus. PMI, by definition, quantifies how often two tokens occur, compared with what we would expect if they were independent. Our proposed strategy is based on the PMI formulation proposed by Levine et al. (2021) that extends PMI to n-grams as follows: PMI(1,n) = min σ∈seg(w1...wn) log p(w1 . . .wk) Π s∈σ p(s) (1) where PMI(1,n) is the PMI for the n-gram {w1,⋯,wn} and seg(w1,⋯,wn) is the set of all contiguous segmentations of the n-gram. We request our readers to refer to the original paper for more algorithmic details. However, this base formulation faces two main challenges when extended to legal documents: (a) The PMI formulation is designed to favor tokens with a lower frequency, making it choose rare tokens and not the text fragments of interest. This is further exacerbated by the fact that text in the legal domain is rich in casespecific, rare, and emerging entities.(b) There is no clear way to retain hints for reconstruction in the original formulation. Since legal language is highly domain-specific, not doing so might lead a model to hallucinate or training to collapse (Li et al., 2021; Sadeq et al., 2022). We describe how we overcome (b) in step (3). To overcome (a), we propose modifying the existing formulation by imposing a discounting factor to penalize rare tokens (Pantel and Lin, 2002). Thus, our modified formulation is as follows: PMI(1,n) ∗ log f (w1 . . .wn) log(c) + log f (w1 . . .wn) (2) where f(.) is the frequency of occurrence of the n-gram, and c is the constant factor used as a threshold to remove rare tokens. Precisely, c refers to the minimum frequency of occurrence of an ngram in the corpus below which the n-gram will be penalized. c is calculated based on the density of rare tokens in the corpus and is usually set to the pcth percentile of the frequency distribution of all n-grams in the corpus. We choose c specific to the value of n in the n-gram in the specific corpus. Generally, PMI for datasets with a higher degree of rare entities per document is discounted with a c corresponding to a frequency at a higher pc (like Caselaw (cas, 2018) and Edgar (Henderson et al., 2022)). In contrast, datasets with a lower degree of entities or lower overall degree of formal language are discounted with a c corresponding to a frequency at a lower pc (like r/legaladvice (Henderson et al., 2022)). Finally, we select the top j% of n-grams with the lowest PMI score to construct SC. We provide more details in Appendix B.1, including examples to show the effect of c on correlated span extraction. (2) Optimal Context Selection. Legal corpora, labeled and unlabeled, are generally structured at the granularity of document-level (collection of sentences). However, they are generally long (see Appendix H for dataset details), and denoisingbased pre-training with an enc-dec model allows us to scale only to the maximum output sequence length ly of the decoder (irrespective of the encoder input sequence length). As mentioned earlier, LEGA employs BART-large with a maximum output sequence length of 1024 tokens (Appendix E explains the rationale behind our choice.). A common choice for such a scenario would be to just select the first ly tokens from the document Draw to form a shorter document Dp. However, this creates a text-informativeness mismatch between pre-training and fine-tuning instances, as raw legal documents have sparse information compared to fine-tuning instances (Sugathadasa et al., 2019). Thus, we choose to perform optimal context selection or select sentences from the document with a high informativeness measure. To this end, we propose to use the PageRank algorithm (Page et al., 1999), boosted by sentence similarity. Given a document Draw, with sentences [s Draw 0 ,⋯, sDraw n ], we use an encoder Epre to calculate the embedding of each sentence [es0 , ⋯, esn] and the entire document eDraw. This is followed by calculating the cosine similarity between every 2 sentences in the corpus, indexed i and j, as follows: si,j = esi ⋅ esf ∥esi∥ ∥esf∥ (3) where i, j ∈ {1,⋯, n} and esf is defined as esf = λesj + (1 − λ)eDraw. Post this step; we construct an n × n similarity matrix, which serves as an adjacency matrix for constructing a graph G = (V, E) where the sentences form the vertices V and the similarity scores form the edges E. Finally, we apply PageRank(G) to assign every sentence an importance score and select the top-k sentences not exceeding 1024 tokens. Following this, we sort the sentences in the document’s original order of occurrence. We sample a probability ε from a Gaussian distribution N(μ, σ2), and only do this step if ε crosses a set threshold β.

(3) Selective Masking. Once we obtained the set of correlated spans SC from step (1) and Dp from step (2), we now want to select the best candidates for masking from all spans in SDp. SDp are the spans in SC only present in document Dp. To this end, we devise a novel span-ranking metric to construct our template such that we preserve valuable hints but also prefer longer spans. Formally put, we first use a pre-trained encoder Epre to calculate the embedding of each span as [esp0 ,⋯, espT] and the entire document as eDp followed by assigning an importance score it to each span spt as follows: it = sim(espt , eDp) norm(len(spt)) (4) where sim(.) is the cosine similarity between each espt and eD calculated similarly to Eqtn. 3. The denominator is the length of the span normalized across all spans in SDp to assign higher importance to smaller spans. Finally, to create our template, we preserve the top-p spans in S, not exceeding 20% of the entire document length, and mask all other spans in SDp. Finally, Each span is replaced by a single mask token. To introduce randomness into the process, we sample a probability γ from a Gaussian distribution N(μ, σ2) and randomly preserve a token in a contiguous span of tokens to be masked if γ crosses a set threshold α. After obtaining template T for all documents in the corpus for all corpora, we pre-train DALE on the denoising objective to reconstruct Dp from T .

3.2 DALE Fine-tuning

Though pre-trained DALE serves as an effective general-purpose data augmentation model for lowresource LLU tasks, we prefer to fine-tune BART on our downstream dataset so that our generated augmentations exhibit an underlying data distribution similar to our gold dataset. This has been seen as critical to improving in-domain performance with scale (Geiping et al., 2023). However, extracting correlated spans with PMI from fine-tuning datasets with few samples is generally ineffective as PMI becomes effective only with scale (Fano, 1961). Thus, to construct a template, we extract all n-grams N = {n0, ⋯, nt, ⋯, nT} from a particular document (or training instance) Df and assign an importance score to each by calculating cosine similarity, similar to Eqtn. 3, between Epre(nt) and (λ × Epre(Df) + (1 − λ) × Epre(LDf )) . LDf here is the label for the document Df . We elaborate in Appendix I.1 on how we construct LDf for tasks beyond multi-class classification. Finally, we preserve the top-p n-grams and mask everything else in the sentence, before merging consecutive masks. For datasets with documents exceeding 1024 tokens, we propose a sliding window mechanism for fine-tuning. Specifically, with a window of size w tokens, we break down a long sequence into its constituent segments of 1024 tokens, with each segment beyond the initial segment having additional non-masked context from the previous window. This context is additionally bounded between special tokens <context> and </context> to provide the model with explicit supervision. We provide a detailed explanation in Appendix D on why the DALE fine-tuning masking algorithm is not well suited for pre-training and better fits the fine-tuning stage.

= 3.3 DALE Generation

To generate data augmentations using DALE, we construct a template by corrupting a sentence similar to the fine-tuning stage and condition it to the model to generate augmentations. We use beam search with random multinomial sampling to generate diverse augmentations. Finally, we employ a sliding window mechanism for long documents, combining outputs from all sliding window segments for the final augmentation. After generating augmentations, we add them to the gold annotated data to fine-tune our downstream evaluation model.

4 Experiments and Results

4.1 Tasks and Datasets

Pre-training. To pre-train DALE, we use a combination of multiple datasets from Pile of Law (Henderson et al., 2022), CaseLaw (cas, 2018), and MAUD (Wang et al., 2023). The final pre-training corpus comprised ≈ 4.1M documents amounting to ≈ 48GB. Detailed statistics are in Appendix H. Downstream Evaluation. To prove the efficacy of DALE, we conducted experiments on 13 legal datasets based on 6 tasks across 4 low-resource settings. These tasks include Multi-class classification (MCC), Multi-label classification (MLC), Named Entity Recognition (NER), Multiple choice QA (MCQ) (identify the correct (masked) holding statement from a selection of choices), Rhetorical Role Prediction (RR) (sequential text classification for assigning a label to each sentence in a legal document for semantic document segmentation), and Document-level NLI (DLI). For MCC, we experiment on SCOTUS (Spaeth et al., 2013), LEDGAR (Tuggener et al., 2020), ILDC (Malik et al., 2021) and OTS-UL (Drawzeski et al., 2021)

Gold 100 200 500 1000 100 200 500 1000 100 200 500 1000 100 200 500 1000 100 200 500 1000

Dataset OTS-TOPICS EUR-LEX ECtHR-A ECtHR-B UNFAIR-ToS Gold-only 0.10 11.47 51.16 53.87 8.68 4.30 10.32 42.26 25.26 27.30 17.14 31.52 37.69 47.47 44.89 50.98 0.10 33.88 70.02 76.21 EDA 9.72 38.43 37.56 46.99 12.11 22.93 49.26 51.54 10.10 35.64 41.91 49.67 43.01 48.70 56.32 59.40 13.93 26.31 72.15 78.14 Legal-EDA 10.10 39.15 40.40 50.48 12.45 23.61 51.24 53.27 12.24 36.75 43.89 52.93 43.86 54.72 57.71 61.53 15.86 27.54 72.98 78.69 SSMBA 10.41 15.28 47.31 52.63 4.10 21.32 45.67 48.70 7.55 18.10 34.39 37.58 35.32 45.43 48.08 52.65 6.53 18.21 63.96 68.59 AEDA 14.06 52.63 60.29 72.32 3.07 33.33 50.33 52.21 28.12 30.94 32.29 45.48 39.15 50.85 50.48 51.26 8.08 52.34 70.48 73.67 SMERTI 3.41 17.90 57.26 60.54 6.62 27.86 44.45 47.68 28.51 22.61 23.43 38.59 38.43 51.02 52.07 53.71 20.46 47.31 59.38 69.27 BackTrans 8.26 37.44 47.47 50.85 5.03 19.63 37.86 42.65 14.73 17.37 35.36 39.41 37.61 49.88 50.77 52.83 12.84 39.28 46.51 62.64 C-MLM 3.85 17.95 58.54 61.45 7.17 28.21 45.04 47.85 27.95 23.24 23.89 39.23 39.46 52.17 53.26 54.68 20.42 48.52 59.87 69.62 GENIUS 25.58 54.31 63.71 67.29 5.79 34.03 53.19 57.95 28.68 28.66 36.38 43.67 40.40 44.03 50.54 54.29 11.20 47.18 67.71 75.79 ChatGPT 23.42 53.31 62.17 65.87 5.52 33.22 52.21 56.45 27.52 27.89 34.03 41.83 39.61 43.12 49.76 53.87 10.78 44.62 65.87 72.91 Falcon 12.36 37.84 48.66 51.74 5.11 22.02 46.19 49.03 17.68 20.39 35.81 38.62 36.12 46.53 47.27 53.85 5.44 16.10 62.82 67.51 DALE-BART 25.77 54.01 58.29 68.04 12.32 34.39 53.65 56.27 23.01 35.68 40.13 52.47 43.91 52.76 54.58 60.24 18.43 46.60 68.21 75.04 DALE-pt 24.58 52.17 58.18 69.97 11.50 29.51 51.63 53.12 24.19 33.87 40.87 48.85 42.97 51.67 51.63 59.23 18.54 47.59 63.21 73.56 DALE-ft 24.63 53.22 59.64 70.15 11.61 33.54 52.38 57.62 24.21 34.76 41.78 51.65 43.33 53.74 55.12 60.95 19.11 48.71 67.42 74.86 DALE (ours) 33.91 61.23 71.56 73.24 13.50 37.93 55.99 59.45 29.43 37.57 44.38 55.72 46.72 56.13 59.18 64.01 22.32 54.62 74.84 82.98 Table 2: Results for Multi-label classification. DALE outperforms baselines by 1%-49.8%.

Gold 100 200 500 1000 100 200 500 1000 100 200 500 1000 100 200 500 1000

LEDGAR ILDC SCOTUS OTS Gold-only 22.65 61.39 71.43 75.13 51.48 54.24 55.83 58.03 63.69 65.93 70.75 75.92 66.72 68.59 70.21 72.54 EDA 42.65 59.31 72.34 75.76 49.76 49.83 59.32 61.72 53.00 61.57 72.51 73.29 68.93 69.66 72.13 73.28 Legal-EDA 53.00 60.57 73.28 76.72 52.15 52.23 60.38 62.27 55.21 61.39 73.69 75.57 69.51 71.67 76.31 79.72 SSMBA 47.86 60.34 70.06 74.21 47.62 50.21 58.53 60.12 43.00 60.57 72.51 76.26 60.12 70.17 75.47 76.04 AEDA 46.99 58.06 71.01 75.35 48.93 49.62 56.36 59.05 62.15 62.65 71.24 73.55 61.29 67.08 74.26 81.26 SMERTI 33.23 60.65 62.24 67.25 42.34 44.82 51.27 58.73 63.78 66.71 70.92 71.57 66.99 68.72 76.58 80.58 BackTrans 51.23 58.96 63.84 69.04 40.72 41.33 59.18 62.01 42.01 45.63 57.22 67.56 59.69 65.81 66.23 71.53 C-MLM 34.12 60.95 63.11 68.15 43.18 45.65 52.01 58.98 61.56 65.54 71.25 71.95 67.05 68.97 77.52 79.62 GENIUS 48.76 62.14 71.17 74.48 51.35 54.26 53.39 52.14 59.42 61.71 63.14 70.28 66.71 68.65 76.20 79.73 GPT3-Mix 30.37 58.74 61.62 66.44 41.87 43.73 50.45 57.52 63.42 65.82 70.87 71.03 66.73 67.53 77.07 79.21 PromDA 45.76 51.24 65.40 68.27 41.30 43.08 49.21 51.27 44.59 53.86 59.72 61.58 63.72 65.73 70.38 73.28 ChatGPT 46.87 61.18 70.41 73.92 50.74 52.93 52.34 51.21 58.69 60.56 62.81 69.40 65.01 67.88 75.32 78.19 Falcon 43.07 58.32 68.48 73.62 46.29 48.27 57.83 58.03 42.11 59.83 60.32 70.54 59.19 66.25 73.17 75.08 DALE-BART 50.95 57.90 64.28 70.87 52.26 51.54 54.31 62.68 60.01 65.27 62.02 72.13 69.12 70.89 71.99 77.97 DALE-pt 48.26 55.39 65.27 67.94 52.02 51.87 57.26 58.51 59.61 63.25 66.72 68.85 69.93 70.21 73.68 75.89 DALE-ft 52.01 58.67 68.38 72.24 52.14 53.88 58.15 61.92 59.70 64.62 65.46 72.41 68.85 70.91 74.31 77.58 DALE (ours) 55.13 63.76 74.89 78.36 54.47 55.95 62.45 63.11 65.85 67.86 74.89 78.96 71.64 72.89 77.74 83.75

Table 3: Results for Multi-class classification. DALE outperforms baselines by 1%-49.8%.

datasets. For MLC, we experiment on ECtHR Task A and B (Chalkidis et al., 2019, 2021b), EURLEX (Chalkidis et al., 2021a), UNFAIR-ToS (Lippi et al., 2019) and OTS-CT (Drawzeski et al., 2021) datasets. For NER, we experiment on EDGAR (Au et al., 2022), and the Indian-Legal-NER (Kalamkar et al., 2022) datasets. For RR, we experiment on the BUILD dataset (Malik et al., 2022). Finally, for DLI, we experiment on the ContractNLI (Koreeda and Manning, 2021). We perform class-balanced sampling to create low-resource splits and downsample the dev set accordingly. Dataset statistics are in Appendix H. We report micro-averaged F1 scores averaged across 3 runs for 3 random seeds.

4.2 Experimental Setup

DALE. As mentioned earlier, we use BART-large (Lewis et al., 2019) as our encoder-decoder model for training DALE. We detail in Appendix E why we think BARTlarge is the most suitable for our task and setup. We pre-train DALE for 5 epochs using Adam optimizer with a learning rate of 1e −5 and a batch size of 32. We use the same setting for fine-tuning DALE but with a learning rate of 1e −3. Downstream Task-Specific Setups. For downstream task-specific evaluation, we fine-tune legallongformerlarge (Chalkidis* et al., 2023). For finetuning legal-longformerlarge, we fine-tune for 20 epochs with a batch size of 16 using Adam optimizer with a learning rate of 1e −5. Details about the hyper-parameter setup for our experiments can be found in Appendix B including hyper-parameter tuning experiments. 4.3 Baselines Details on the working of each baseline can be found in Appendix F. Gold-only Baseline. This baseline is common across tasks and uses only gold data without any additional augmentations. Classification Baselines. For MLC, we compare DALE against EDA (Wei and Zou, 2019), Legal- EDA (Perçin et al., 2022), GENIUS(-ft) (Guo et al.,

Gold 100 200 500 1000 100 200 100 200

Dataset CaseHOLD BUILD-RR ContractNLI Gold-only 33.92 66.38 70.06 70.80 74.62 78.24 72.03 82.06 EDA 56.38 64.71 66.42 69.45 77.33 81.83 73.92 75.40 AEDA 57.96 65.10 69.12 70.05 77.95 82.01 77.24 83.02 SSMBA 62.01 67.65 69.59 69.75 77.77 81.66 76.27 82.93 SMERTI 56.52 64.13 69.15 69.85 77.42 80.65 76.23 81.95 BackTrans 55.69 65.72 69.29 69.74 77.59 81.08 75.98 81.19 GENIUS 55.84 61.37 64.17 68.20 78.99 79.30 77.28 81.28 ChatGPT 54.67 60.83 62.57 67.59 77.32 78.37 76.29 80.10 Falcon 52.57 58.76 62.41 63.22 75.11 77.61 75.17 77.54 DALE-BART 61.21 66.09 67.91 70.64 78.59 80.01 76.56 81.27 DALE-pt 59.25 65.69 67.81 69.70 78.15 79.01 76.97 80.55 DALE-ft 60.31 66.56 68.46 70.15 78.50 79.72 77.10 81.73 DALE (ours) 63.71 68.14 71.53 72.70 81.83 83.04 79.26 85.13

Table 4: Results for MCQA (CaseHOLD), RR (BUILD-RR), and DLI (ContractNLI). DALE outperforms by 0.5%-29.8%.

2022a), SSMBA (Ng et al., 2020b), AEDA (Karimi et al., 2021), SMERTI (Feng et al., 2019), Back- Trans (Yu et al., 2018), C-MLM (Kumar et al., 2020), ChatGPT (Dai et al., 2023) and instructiontuned Falcon (Penedo et al., 2023). For MCC, we add to this list GPT3-Mix (Yoo et al., 2021) and PromDA (Wang et al., 2022). Since GENIUS and C-MLM involve pre-training, we pre-trained it on our data with their respective masking algorithms. Other Task Baselines For NER, we compare DALE against LwTR (Dai and Adel, 2020), DAGA (Ding et al., 2020), MulDA (Liu et al., 2021), MELM (Zhou et al., 2022b), PromDA , ChatGPT and instruction-tuned Falcon. For RR, DLI and MCQA, we compare DALE against EDA, GENIUS, SSMBA, AEDA, and BackTrans.

DALE Ablations. To evaluate the effectiveness of the core steps in the DALE augmentation framework, we also compare DALE with other baselines on DALE-pt (augmentations generated with only a pre-trained DALE without any finetuning) and DALE-ft (augmentations generated with only a fine-tuned Legal-BART without DALE Pre-training). DALE-BART is DALE pre-trained on Pile-of-Law with random masking. We provide additional results in Appendix B.

4.4 Results

Quantitative Comparison. Table 3 compares the performance of DALE with other baselines on MCC (top-row) and MLC (bottom-row). DALE outperforms baselines with absolute improvements in the range of 1%-32.5% for MLC and 1%-49.8% for MCC. Table 5 compares the performance of DALE with other baselines on NER. DALE outperforms baselines with absolute improvements in the range of 1%-39.6%. Table 4 compares the perfor-

Gold 100 200 500 1000 100 200 500 1000

Baselines EDGAR INDIAN LEGAL NER Gold-only 0.75 0.27 34.86 57.84 8.41 13.61 33.28 42.6 LwTR 22.10 36.84 50.33 54.15 12.53 17.87 35.54 44.15 DAGA 13.21 24.54 36.15 42.58 5.13 14.52 26.13 31.74 MulDA 8.17 21.33 42.61 50.16 13.75 19.28 31.96 40.69 MR 19.13 36.62 50.95 58.33 18.62 25.26 43.14 49.68 MELM 12.32 24.35 48.72 60.59 14.55 21.69 38.73 48.64 GENIUS 13.79 28.44 50.93 62.69 19.05 29.28 48.72 53.61 PromDA 10.10 27.31 45.77 55.62 16.46 26.91 45.34 44.62 ChatGPT 12.65 26.32 49.25 60.67 18.24 27.58 46.44 51.41 Falcom 11.24 25.71 48.69 59.84 18.11 26.23 43.05 49.38 DALE-BART 17.76 34.20 48.71 57.99 16.43 29.19 46.03 49.96 DALE-pt 18.38 33.12 47.67 53.67 17.25 27.86 45.57 48.28 DALE-ft 19.10 35.39 48.20 58.74 17.65 28.32 46.71 49.98 DALE (ours) 23.65 39.82 55.99 64.32 21.31 32.47 49.93 54.27 Table 5: Results for NER. DALE outperforms by 1% - 39.6%. Method Perplexity(↓) Diversity(↑) Diversity-L(↑) Perplexity(↓) Diversity(↑) Diversity-L(↑) 200 500 EDA 82.22 12.49 83.48 86.14 12.72 86.28 Legal-EDA 55.38 25.71 13.51 58.92 26.70 14.26 SSMBA 37.96 54.74 17.74 37.84 56.85 19.29 AEDA 26.93 2.17 176.68 27.05 13.67 145.13 SMERTI 28.56 56.84 13.76 29.20 59.62 14.58 BackTrans 27.94 45.05 27.62 27.85 49.05 28.62 C-MLM 50.39 41.04 23.85 51.69 44.86 25.69 GENIUS 24.37 106.08 226.65 24.65 105.04 278.64 GPT3-Mix 52.76 42.21 29.74 53.21 45.73 33.68 PromDA 174.67 65.69 15.74 187.68 73.93 16.84 LWTR 481.34 86.91 49.87 413.66 76.37 21.42 MR 82.72 75.65 29.23 79.65 81.46 32.76 MELM 211.94 12.49 83.48 183.23 12.72 86.28 ChatGPT 26.29 64.31 32.85 26.17 66.94 35.85 Falcon 45.24 13.64 17.63 44.97 15.74 18.59 DALE-BART 20.36 172.54 222.37 21.65 193.32 231.86 DALE-pt 58.09 66.99 260.00 60.12 59.84 294.05 DALE-ft 18.75 149.77 219.22 20.21 156.54 200.99 DALE (ours) 18.63 175.38 227.39 18.44 194.20 234.86

Table 6: Quantitative evaluation of generation quality on the measures of perplexity, token diversity (Diversity), and length diversity (Diversity-L). DALE outperforms all our baselines.

mance of DALE with other baselines on MCQA, RR, and DLI. DALE outperforms baselines with absolute improvements in the range of 0.5%-29.8% in MCQA, 1%-7.2% in RR, and 2%-9.7% in DLI. DALE-BART performs similarly to DALE-ft and is inferior to DALE, thereby showing the ineffectiveness of random masking for the legal domain. Qualitative Comparison. Table 6 compares the generation quality of DALE with all our baselines (averaged baseline-wise across all tasks and splits) on the measures of perplexity (Jelinek et al., 1977), diversity (average number of new tokens introduced in R augmentations) and length diversity (average absolute difference in length of source and R augmentations). DALE outperforms most of our baselines in all settings. DALE-pt generates more diverse augmentations but at the cost of not maintaining underlying data distribution. Beyond Table 1, Table 18 provides more augmentation examples. Contrary to our baselines, that are too conservative or too aggressive, DALE, especially for long documents, generates augmentations that are diverse, coherent, and consistent with the source label. Table 7: Comparison of augmentations generated by DALE and all other baselines for the UNFAIR TOS dataset. All augmentations were generated in a low-resource setting (500). Each augmentation was marked by a law student on 3 parameters: (1) If the augmentation is coherent, (2) If it adds new plausible context, and (3) if it is label-consistent and matches the underlying data distribution. We present the results of the study as ✓or ✗next to each augmentation in the same order as above. Pink signifies the change from the Original. More examples can be found in Table 18. UNFAIR ToS Original The most recent version of this agreement will be posted on the services under settings and also on gotinder.com, and you should regularly check for the most recent version. EDA recent version of this agreement will be posted on the services under settings and also on gotinder com and you should regularly check for the most recent version ✗ ✗ ✓ AEDA the most ; recent version of ; this agreement will be posted on the , services under settings and also on gotinder.com . , and you should regularly check for the most recent version . , ✗ ✗ ✓ SMERTI This most recent version of Windows will be posted on power under settings available on gotinder. , and you should regularly check our most recent version. ✗ ✗ ✗ GENIUS The terms of this agreement will be contingent on the services they provide. For more information, please visit www.sos.gov. ✓ ✗ ✗ ChatGPT The latest edition of this agreement will be made available on the services, specifically under the settings section and on gotinder.com. It is advisable to frequently review the most recent version. ✓ ✗ ✓ Falcon The most recent version of this agreement will be posted on the services under settings and also on gotinder.com, and you should regularly check for the most recent version. ✓ ✗ ✓ DALEpt The most recent version of this agreement shall be accepted as the most recent amendment . ✓ ✗ ✗ DALEft the most recent version of this agreement will be posted on the services under settings and also on gotinder.com, and you should regularly check for the most most recent versions. ✓ ✗ ✓ DALE The most recent version of this agreement will be posted on the services’s website at https://www.adr.nianticlabs.com/ where you can download and view the services, and you should be aware that this is not a guarantee that the services will be up to code or up to date, and we reserve the right to discontinue using the services at any time. ✓ ✓ ✓ 5 Conclusion This paper presents DALE, a novel generative data augmentation framework for low-resource legal NLP. We evaluate DALE on 13 datasets spanning across 6 tasks under 4 low-resource settings and show that DALE outperforms all prior art quantitatively and qualitatively by a significant margin.

Acknowledgement

This work was supported by ARO grants W911NF2310352 and W911NF2110026.

Limitations and Future Work

In this section, we list down some potential limitations of DALE:

1. DALE is still restricted to generating augmentations for legal datasets that consist of documents only in English. Though English is prevalent in the legal literature across domains and genres, recent work shows the importance of multi-lingual legal language modeling (Niklaus et al., 2023). As part of future work, we would like to overcome this shortcoming by introducing multi-lingual DALE.

2. At extreme low-resource scenarios, DALE accompanied by optional fine-tuning might be prone to over-fitting, generating almost similar augmentations. Though using pre-trained DALE can overcome this problem, our experiments clearly show the benefits of finetuning. Thus, as part of future work, we would like to explore the combination of augmentations generated by pre-training and fine-tuned DALE.

3. Our masking algorithm involves PMI, which is beneficial only at scale. Though benefiting from scale is an inherent property of pretraining, we would like to explore possible ways to overcome this problem.

Ethics Statement

We acknowledge that augmentations generated by DALE might not be always factual, i.e., contain events that have occurred in the real world. However, DALE is not directly meant for helping a legal practitioner in his everyday practice through its generations. Instead, DALE is meant for only generating augmentations to help train downstream models that can help legal practitioners in their practice.

A Algorithm We show DALE algorithmically in Algorithm 1. Algorithm 1 DALE: Our proposed augmentation framework Given pre-training dataset C, Enc-Dec PLM L and Enc-only PLM P Cmasked ← ∅ NC ← C ▷Extract all n-grams SC ← NC ▷Extract all correlated spans from n-grams SC ← SC ▷Select only top j% for Draw ∈ C do ▷Masking Loop Dp ← Draw ▷Optimal Context Selection SDp ← SC ▷Filter only spans present in Dp Rank all spans in SDp T ← Dp Keep top-p spans and mask the rest ▷Selective Masking end for Pre-train L with denoising to reconstruct Dp from T Given low-resource fine-tuning dataset Dtrain, and DALE ▷Optional FT for {X,Y } ∈ Dtrain do T ← X ▷Selective Masking end for Fine-tune L with denoising to reconstruct X from T for {X,Y } ∈ Dtrain do ▷Generation Loop repeat R times: T ← X ▷Selective masking Xaug ← GENAUG(DALE(T )) ▷Generate augmented data Daug ← Daug ∪ {Xaug} end for Fine-tune P with Daug return P B Hyperparameter Tuning Hyperparameters. We set q to 7 for n-gram extraction. Values of c and pc are provided in Appendix B.1. We choose legal-longformerlarge as Epre(.). For PMI selection we set j to 50%. For optimal context selection we set μ, σ2, and β to be 0.5, 0.7, and 0.3 respectively. For selective masking, we set μ, σ2, and α to be 0.4, 0.6, and 0.4 respectively. For optimal context selection we set λ to 0.7 and 0.5 for downstream DALE fine-tuning. We set augmentation rounds R to be 5. All hyperparameters were tuned on the dev set. We also show the tuning results of some important hyperparameters in the following sub-sections. B.1 Discounting Factor c Table 15 details the discounting factor c corresponding to the percentile pc for each corpus used in DALE pre-training. A corpus with documents that are entity-rich has a higher discounting factor (Caselaw) compared to a corpus with more natural language sentences and, thus, lesser entities (r/legaladvice). Table 15 provides examples of correlated spans extracted through PMI calculation before and after discounting. Clearly, the discounting factor plays a major role in extracting spans that are reusable text fragments with fewer entities. B.2 Augmentation rounds R Table 8 compares the performance of DALE at different values of R. Augmenting the training dataset with several augmentation rounds R proves effective until a saturation point is reached. Downstream LLU performance improves when more DALE augmentations are added to the gold, similar to findings in Geiping et al. (2023). 1 2 3 4 5 6 7 53.67 54.58 55.02 58.94 59.35 59.31 59.09 Table 8: F1 for various settings of R. All values are averaged across all datasets and all low-resource settings. B.3 DALE without Optimal Context Selection Table 9 compares the performance of DALE with and without optimal context selection. We show that optimal context selection plays a significant role in improving the performance of DALE. w/ Optimal Context w/o Optimal Context 59.35 57.46 Table 9: F1 with and without optimal context selection. All values are averaged across all datasets and all lowresource settings.

References

2018. Caselaw access project. Online. Accessed on April 25, 2023.

Azad Abad and Alessandro Moschitti. 2016. Taking the best from the crowd:learning question passage classification from noisy data. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 136–141, Berlin, Germany. Association for Computational Linguistics. Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-theart NLP. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59. Ting Wai Terence Au, Vasileios Lampos, and Ingemar Cox. 2022. E-NER — an annotated named entity recognition corpus of legal text. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 246–255, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme. 2020. Tax Law NLP Resources. Łukasz Borchmann, Dawid Wisniewski, Andrzej Gretkowski, Izabela Kosmala, Dawid Jurkiewicz, Łukasz Szałkiewicz, Gabriela Pałka, Karol Kaczmarek, Agnieszka Kaliska, and Filip Grali´nski. 2020. Contract discovery: Dataset and a few-shot semantic retrieval challenge with competitive baselines. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4254–4268, Online. Association for Computational Linguistics. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. Ilias Chalkidis. 2023. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. arXiv preprint arXiv:2304.12202. Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy. Association for Computational Linguistics. Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021a. Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv preprint arXiv:2109.00904. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559. Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. 2021b. Paragraph-level rationale extraction through regularization: A case study on European court of human rights cases. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 226–241, Online. Association for Computational Linguistics. Ilias Chalkidis*, Nicolas Garneau*, Catalina Goanta, Daniel Martin Katz, and Anders Søgaard. 2023. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada. Association for Computational Linguistics. Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. 2021c. Lexglue: A benchmark dataset for legal language understanding in english. arXiv preprint arXiv:2110.00976. Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. 2023. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics, 11:191–211. Shuguang Chen, Leonardo Neves, and Thamar Solorio. 2022. Style transfer as data augmentation: A case study on named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1827–1841, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Haixing Dai, Zheng Liu,Wenxiong Liao, Xiaoke Huang, Zihao Wu, Lin Zhao, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li. 2023. Chataug: Leveraging chatgpt for text data augmentation. ArXiv, abs/2302.13007. Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3861–3867, Barcelona, Spain (Online). International Committee on Computational Linguistics. Bosheng Ding, Linlin Liu, Lidong Bing, Canasai Kruengkrai, Thien Hai Nguyen, Shafiq Joty, Luo Si, and Chunyan Miao. 2020. DAGA: Data augmentation with a generation approach for low-resource tagging tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6045–6057, Online. Association for Computational Linguistics. Kasper Drawzeski, Andrea Galassi, Agnieszka Jablonowska, Francesca Lagioia, Marco Lippi, Hans Wolfgang Micklitz, Giovanni Sartor, Giacomo Tagiuri, and Paolo Torroni. 2021. A corpus for multilingual analysis of online terms of service. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 1–8, Punta Cana, Dominican Republic. Association for Computational Linguistics. Robert M Fano. 1961. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794. Steven Y. Feng, Aaron W. Li, and Jesse Hoey. 2019. Keep calm and switch on! preserving sentiment and fluency in semantic text exchange. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2701–2711, Hong Kong, China. Association for Computational Linguistics. Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho So, Shengding Hu, Zhiyuan Liu, and Nigel Collier. 2023. Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:2304.04052. Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. 2023. How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. In The Eleventh International Conference on Learning Representations. Sreyan Ghosh, Utkarsh Tyagi, Sonal Kumar, and Dinesh Manocha. 2023. Bioaug: Conditional generation based data augmentation for low-resource biomedical ner. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. Ondˇrej Glogar. 2023. The concept of legal language: What makes legal language ‘legal ‘? International Journal for the Semiotics of Law-Revue internationale de Sémiotique juridique, pages 1–27. Biyang Guo, Yeyun Gong, Yelong Shen, Songqiao Han, Hailiang Huang, Nan Duan, andWeizhu Chen. 2022a. Genius: Sketch-based language model pre-training via extreme and selective masking for text generation and augmentation. arXiv preprint arXiv:2211.10330. Hongyu Guo. 2020. Nonlinear mixup: Out-of-manifold data augmentation for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4044–4051. Hongyu Guo, Yongyi Mao, and Richong Zhang. 2019. Augmenting data with mixup for sentence classification: An empirical study. arXiv preprint arXiv:1905.08941. Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022b. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics. Rupert Haigh. 2023. International Legal English: A Practical Introduction for Students and Professionals. Routledge. Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217–29234. Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. CUAD: An expert-annotated nlp dataset for legal contract review. arXiv preprint arXiv:2103.06268. Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair. 2021. Context-aware legal citation recommendation using deep learning. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL ’21, page 79–88, New York, NY, USA. Association for Computing Machinery. HuggingFace. 2023. Huggingfaceh4/ open_llm_leaderboard. Maor Ivgi, Uri Shaham, and Jonathan Berant. 2023. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299. Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2022. Survey of hallucination in natural language generation. ACM Computing Surveys. Prathamesh Kalamkar, Astha Agarwal, Aman Tiwari, Smita Gupta, Saurabh Karn, and Vivek Raghavan. 2022. Named entity recognition in Indian court judgments. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 184–193, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Tian Kang, Adler Perotte, Youlan Tang, Casey Ta, and Chunhua Weng. 2020. UMLS-based data augmentation for natural language processing of clinical research literature. Journal of the American Medical Informatics Association, 28(4):812–823. Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2021. AEDA: An easier data augmentation technique for text classification. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2748–2754, Punta Cana, Dominican Republic. Association for Computational Linguistics. Daniel Martin Katz, Dirk Hartung, Lauritz Gerlach, Abhik Jana, and Michael J Bommarito II. 2023. Natural language processing in the legal domain. arXiv preprint arXiv:2302.12039. Hazel H Kim, Daecheol Woo, Seong Joon Oh, Jeong- Won Cha, and Yo-Sub Han. 2022. Alp: Data augmentation using lexicalized pcfgs for few-shot text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10894– 10902. Philipp Koehn et al. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86. Citeseer. Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A dataset for document-level natural language inference for contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, Punta Cana, Dominican Republic. Association for Computational Linguistics. Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In AACL 2020 Workshop on Lifelong Learning for Spoken Language Systems. Elena Leitner, Georg Rehm, and Julian Moreno- Schneider. 2019. Fine-grained named entity recognition in legal documents. In Semantic Systems. The Power of AI and Knowledge Graphs: 15th International Conference, SEMANTiCS 2019, Karlsruhe, Germany, September 9–12, 2019, Proceedings, pages 272–287. Springer. Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, and Yoav Shoham. 2021. {PMI}-masking: Principled masking of correlated spans. In International Conference on Learning Representations. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Zhaowen Li, Zhiyang Chen, Fan Yang,Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, et al. 2021. Mst: Masked selfsupervised transformer for visual representation. Advances in Neural Information Processing Systems, 34:13165–13176. Marco Lippi, Przemyslaw Palka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2018. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. CoRR, abs/1805.01217. Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. Claudette: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law, 27:117–139. Linlin Liu, Bosheng Ding, Lidong Bing, Shafiq Joty, Luo Si, and Chunyan Miao. 2021. Mulda: A multilingual data augmentation framework for low-resource cross-lingual ner. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5834–5846. Shayne Longpre, Yu Wang, and Chris DuBois. 2020. How effective is task-agnostic data augmentation for pretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4401–4411, Online. Association for Computational Linguistics. Vijit Malik, Rishabh Sanjay, Shouvik Kumar Guha, Angshuman Hazarika, Shubham Nigam, Arnab Bhattacharya, and Ashutosh Modi. 2022. Semantic segmentation of legal documents via rhetorical roles. Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4046–4062, Online. Association for Computational Linguistics. Dimitris Mamakas, Petros Tsotsi, Ion Androutsopoulos, and Ilias Chalkidis. 2022. Processing long legal documents with pre-trained transformers: Modding LegalBERT and longformer. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 130–142, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Gabriele Marino, Daniele Licari, Praveen Bushipaka, Giovanni Comandé, and Tommaso Cucinotta. 2023. Automatic rhetorical roles classification for legal documents using legal-transformeroverbert. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Mary Jane Morrison. 1989. Excursions into the nature of legal language. Clev. St. L. Rev., 37:271. Inderjeet Nair and Natwar Modani. 2023. Exploiting language characteristics for legal domain-specific language model pretraining. In Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia. Association for Computational Linguistics. Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. 2020a. Ssmba: Self-supervised manifold based data augmentation for improving out-of-domain robustness. arXiv preprint arXiv:2009.10195. Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. 2020b. SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1268–1283, Online. Association for Computational Linguistics.

An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 299–309, Vancouver, Canada. Association for Computational Linguistics.

(Niklaus & Giofré, 2022) ⇒ Joel Niklaus and Daniele Giofré. (2022). "BudgetLongformer: Can We Cheaply Pretrain a SOTA Legal Language Model from Scratch?” In: arXiv preprint arXiv:2211.17135.
(Niklaus et al., 2023) ⇒ Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias Stürmer, and Ilias Chalkidis. (2023). "Lextreme: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain.” In: arXiv preprint arXiv:2301.13126.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab.

Patrick Pantel and Dekang Lin. 2002. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge discovery and data mining, pages 613–619.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

Sezen Perçin, Andrea Galassi, Francesca Lagioia, Federico Ruggeri, Piera Santin, Giovanni Sartor, and Paolo Torroni. 2022. Combining wordnet and word embeddings in data augmentation for legal texts. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 47–52.

Chen Qian, Fuli Feng, Lijie Wen, Zhenpeng Chen, Li Lin, Yanan Zheng, and Tat-Seng Chua. 2020. Solving sequential text classification as board-game playing. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8640–8648.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).

Carlos Ramisch, Vitor De Araujo, and Aline Villavicencio. 2012. A broad evaluation of techniques for automatic acquisition of multiword expressions. In Proceedings of ACL 2012 Student Research Workshop, pages 1–6.

Federico Ruggeri, Francesca Lagioia, Marco Lippi, and Paolo Torroni. 2021. Detecting and explaining unfairness in consumer contracts through memory networks. Artificial Intelligence and Law, pages 1–34.

Nafis Sadeq, Canwen Xu, and Julian McAuley. 2022. InforMask: Unsupervised informative masking for language model pretraining. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5866–5878, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Ramit Sawhney, Megh Thakkar, Shivam Agarwal, Di Jin, Diyi Yang, and Lucie Flek. 2021. Hypmix: hyperbolic interpolative data augmentation. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 9858– 9868. Abhay Shukla, Paheli Bhattacharya, Soham Poddar, Rajdeep Mukherjee, Kripabandhu Ghosh, Pawan Goyal, and Saptarshi Ghosh. 2022. Legal case document summarization: Extractive and abstractive methods and their evaluation. arXiv preprint arXiv:2210.07544. Ann Sinsheimer. 2007. Christopher williams, tradition and change in legal english: Verbal constructions in prescriptive texts. Language in Society, 36(3):473–474. Harold J Spaeth, Lee Epstein, Andrew D Martin, Jeffrey A Segal, Theodore J Ruger, and Sara C Benesh. 2013. Supreme court database, version 2013 release 01. Database at http://supremecourtdatabase. org. Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, and Madhavi Perera. 2019. Legal document retrieval using document vector embeddings and deep learning. In Intelligent Computing: Proceedings of the 2018 Computing Conference, Volume 2, pages 160–175. Springer. Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, Philip S Yu, and Lifang He. 2020. Mixuptransformer: dynamic data augmentation for nlp tasks. arXiv preprint arXiv:2010.02394. Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. 2021. Scale efficiently: Insights from pretraining and fine-tuning transformers. arXiv preprint arXiv:2109.10686. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. 2023. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations. Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1235–1241, Marseille, France. European Language Resources Association. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Steven H. Wang, Antoine Scardigli, Leonard Tang, Wei Chen, Dimitry Levkin, Anya Chen, Spencer Ball, Thomas Woodside, Oliver Zhang, and Dan Hendrycks. 2023. Maud: An expert-annotated legal nlp dataset for merger agreement understanding. Yufei Wang, Can Xu, Qingfeng Sun, Huang Hu, Chongyang Tao, Xiubo Geng, and Daxin Jiang. 2022. PromDA: Prompt-based data augmentation for lowresource NLU tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4242– 4255, Dublin, Ireland. Association for Computational Linguistics. JasonWei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196. Christopher Williams. 2007. Tradition and Change in Legal English: Verbal Constructions in Prescriptive Texts, volume 20. Peter Lang. Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open, 2:79–84. Yinfei Yang, Oshin Agarwal, Chris Tar, Byron C. Wallace, and Ani Nenkova. 2019. Predicting annotation difficulty to improve task routing and model performance for biomedical information extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1471–1480, Minneapolis, Minnesota. Association for Computational Linguistics. Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park. 2021. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho. 2021. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the eighteenth international conference on artificial intelligence and law, pages 159–168. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020a. How does nlp benefit legal system: A summary of legal artificial intelligence. arXiv preprint arXiv:2004.12158. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020b. Jecqa: A legal-domain question answering dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9701–9708. Jing Zhou, Yanan Zheng, Jie Tang, Li Jian, and Zhilin Yang. 2022a. FlipDA: Effective and robust data augmentation for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8646–8665, Dublin, Ireland. Association for Computational Linguistics. Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan Miao. 2021. Melm: Data augmentation with masked entity language modeling for low-resource ner. arXiv preprint arXiv:2108.13655. Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan Miao. 2022b. MELM: Data augmentation with masked entity language modeling for low-resource NER. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2251–2262, Dublin, Ireland. Association for Computational Linguistics. Willem Zuidema. 2006. What are the productive units of natural language grammar? a dop approach to the automatic identification of constructions. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 29– 36.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 DALEGenerativeDataAugmentationf	Sreyan Ghosh Chandra Kiran Evuru Sonal Kumar S Ramaneswaran S Sakshi Utkarsh Tyagi Dinesh Manocha			DALE: Generative Data Augmentation for Low-Resource Legal NLP				10.48550/arXiv.2310.15799		2023