2024 SaulLM7BAPioneeringLargeLanguag

From GM-RKB
Jump to navigation Jump to search

Subject Headings: SaulLM-7B LLM, Legal LLM, Legal-MMLU.

Notes

Cited By

Quotes

Abstract

In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License.

1 Introduction

In the rapidly evolving landscape of artificial intel- ligence, the applications of large language models (LLMs) (Achiam et al., 2023; Scao et al., 2022; Penedo et al., 2023; Touvron et al., 2023a; Jiang et al., 2023, 2024; Touvron et al., 2023b; Bai et al., 2023) have witnessed large advancements across various domains, like e.g. translation (Xu et al., 2023), medical (Chen et al., 2023), and code gener- ation (Roziere et al., 2023; Li et al., 2023). From natural language processing to machine translation, these models have exhibited exceptional capabil- ities in understanding and generating human-like text (Weber-Wulff et al., 2023; Islam et al., 2023; Mitchell et al., 2023). However, one field that has yet to experience the full benefit of this transforma- tive technology is the legal domain (Martin et al., 2024; Licari and Comandè, 2022). As legal pro- fessionals grapple with an ever-expanding volume of complex documents, there is a growing need for a dedicated LLM that can help navigate and inter- pret legal material (Savelka et al., 2023; Katz et al., 2023; Xiao et al., 2021).

In this paper, we present a pioneering initiative to develop the first legal LLM publicly available. Legal text, characterized by its unique syntax and specialized vocabulary presents a distinct linguistic challenge (Chalkidis et al., 2020; Niklaus et al., 2021). Our approach focuses on extensive pretrain- ing (Gururangan et al., 2020; Yao et al., 2021) using dedicated legal corpora from English-speaking ju- risdictions such as the USA, Canada, the UK, and Europe (Aletras et al., 2016; Gutiérrez-Fandiño et al., 2021). Leveraging the pretraining on a large and diverse legal dataset, both scraped by our team as well as from previous literature (Niklaus and Giofré, 2022), our LLM, SaulLM-7B, aims not only to comprehend the complexities of legal documents but also to adapt to the evolving nature of legal dis- course. By focusing on the needs of legal practitioners and harnessing the power of pretraining on dedi- cated legal corpora, our work represents an impor- tant step towards fulfilling the unique demands of the legal domain. We anticipate that introducing the first LLM for law will not only empower legal professionals but also catalyze further innovation at the intersection of artificial intelligence and the le- gal community - making a significant contribution to legal language understanding and application (Prakken, 2013). We summarize the contributions of this work as follows:

Contribution 1
A family of legal LLMs. In this paper, we introduce the SaulLM-7B’s family, a collection of Legal Language Models meticu- lously crafted to tackle the distinctive challenges encountered within the legal domain. We unveil SaulLM-7B, a 7-billion-parameter language model specifically tailored to legal text. With its special- ized training regimen, SaulLM-7B demonstrates a superior understanding of the nuances in legal lan- guage compared to generic models. Furthermore, we release SaulLM-7B-Instruct, an instruction-tuned variant, carefully engineered to outperform existing models such as Mistral or Llama on a variety of legal tasks1.
Contribution 2
An improved evaluation proto- col for legal LLMs. Concurrently, we introduce LegalBench-Instruct, a supplemental iteration of LegalBench (Guha et al., 2022, 2023)2, crafted to better gauge and refine the legal proficiency of lan- guage models, which we hope will contribute to future advancements into research in the legal do- main. To further enrich the models’ capabilities in legal contexts, we also include the legal tasks of the popular MMLU benchmark (Hendrycks et al., 2020) in our evaluation protocol, particularly fo- cusing on international law, professional law3 and jurisprudence.
Contribution 3
Model, Evaluation Code & Licensing. To foster widespread adoption and promote innovation, we release SaulLM-7B and SaulLM-7B-Instruct, as well as our evaluation code under the MIT License. This open licensing approach encourages collaborative development and adoption into a wide array of commercial and research endeavors within the legal domain and beyond.
1 Model is available at https://huggingface.co/ Equall.
2 Dataset is processed and available at https:// huggingface.co/Equall
3 We use the term “professional law” here as defined in (Hendrycks et al., 2020)

2 SaulLM-7B: Extending the legal capabilities of Language Models

A wide range of open-source large language models is available for the backbone, spanning from 70 million parameter models like Pythia (Biderman et al., 2023) to 180 billion parameter models like Falcon (Almazrouei et al., 2023). In this work, we choose the Mistral 7B model, a 7 billion parameter open-source model that achieves high performance across benchmarks and tasks (Jiang et al., 2023). Our methodology, shown in Figure 1 involves a two-step process that we describe below.

2.1 Enhancing Mistral’s Legal Capabilities

While generic models (Touvron et al., 2023a; Tay- lor et al., 2022; Zhang et al., 2022; Gu and Dao, 2023; Almazrouei et al., 2023; Zhang et al., 2024; Faysse et al., 2024) gain some exposure to legal data during their training, it typically only repre- sents a minor fraction of the overall data. A straight- forward method to enhance performance for legal tasks is to perform additional training focusing on legal data. This approach, particularly focused on decoder models, has been successfully used in var- ious fields such as medicine (Chen et al., 2023; Ji et al., 2023), translation (Xu et al., 2023; Wu et al., 2024), and coding (Roziere et al., 2023). The key advantage of this approach is its scalability and independence from the specific characteristics of the training data. Other research on domain adapta- tion has attempted to specialize language models via pretext tasks. However, these efforts often rely on smaller-scale approaches (Niklaus and Giofré, 2023), are computationally expensive (Vu et al., 2020; Lu et al., 2023), or lack scalability (Cheng et al., 2023; Cui et al., 2023; Nishida et al., 2019). For these reasons, as well as the availability of large-scale legal corpora from the web, we chose to focus on continued pretraining. We meticulously curate a high-quality dataset sourced from diverse legal content repositories. After rigorous filtering (Penedo et al., 2023) and deduplication (Mou et al., 2023; Kocetkov et al., 2023), we end up with a cor- pus of 30 billion tokens, which serves as a robust foundation for continued pertaining.

2.2 Improving Legal Instruction Following

To support user requests and conversational inter- action, LLMs typically undergo instruction tun- ing, a critical process involving training on super- vised conversational pairs. This step is essential for crafting a versatile model, adept at addressing user queries (Wang et al., 2023a; Wei et al., 2021; Chung et al., 2022; Faysse et al., 2023; Ding et al., 2023; Wang et al., 2023b). For general-purpose language models, diver- sity and quality of instruction are crucial (Cao et al., 2023; Zhou et al., 2023). However, in spe- cialized domains it is crucial to incorporate task- specific and specialized prompts to enhance per- formance. Our instruction fine-tuning stage in- volves 2 key components: generic (ie, non-legal) and legal instructions. The former help enhance the model’s understanding and following of com- mands, and includes data from diverse domains such as coding, mathematics, and general conver- sations. For the latter we employ an extensive col- lection of datasets tailored to the nuances of legal domains, covering legal question answering and summarization, among others. Through this meticulous fine-tuning on instructional data, our model, SaulLM-7B-Instruct, is able to grasp legal intri- cacies and excels in a wide range of associated tasks.

Remark. It’s worth noting that many common LLMs (Tunstall et al., 2023) include an additional step of to align the model with human preference (Rafailov et al., 2023; Munos et al., 2023; von Werra et al., 2020). In our case, early experiments did not show any meaningful improvement in per- formance and so we opted to not pursue this avenue for the present paper.

Figure 1: Procedure for constructing SaulLM-7B. We rely on legal datasets augmented with replay data, and instructions datasets. For fine-tuning we enrich our instruction finetuning dataset further with legal instructions.

3 Data

In this section we describe our data collection and cleaning schemes.

3.1 Legal Pretraining Corpora

Unlike fields such as science and medicine, the legal landscape varies significantly across coun- tries and jurisdictions, reflecting differences not only in local laws but also in legal traditions, like common law versus civil law (Henderson et al., 2022). Thus, we gathered legal texts from various jurisdictions, with a primary focus on the English language due to its widespread use in legal contexts worldwide. Our collection includes data from the U.S. (Tuggener et al., 2020), Europe (Chalkidis et al., 2019), and Australia (Butler, 2023), cover- ing a diverse range of legal systems. Through this thorough curation process and aggressive cleaning (see Section 3.1.2), we end up with a corpus of 30 billion tokens, capturing the intricacies of legal language across regions.


3.1.1 Dataset Composition

Legal Sources We combine both previously available datasets, such as the FreeLaw subset from The Pile (Gao et al., 2020) and MultiLegal Pile (Niklaus et al., 2023), as well as data scraped from publicly available sources on the Web. We list the different sources of data in Table 1.

Name Tokens FreeLaw4 15B EDGAR5 5B English MultiLegal Pile6 50B English EuroParl (Koehn, 2005) 6B GovInfo7 Statutes, Opinions & Codes 11B Law Stack Exchange8 19M Commercial Open Australian Legal Corpus9 0.5B EU Legislation10 315M UK Legislation11 190M Court Transcripts12 350M UPSTO13 4.7B Total 94B

Table 1: Sources of Legal Pretraining Data. These sources contain noise and heavily duplicated documents, which we filtered and deduplicated, resulting in a 30 billion tokens dataset.
4 We used the subset from The Pile (Gao et al., 2020).
5 https://www.sec.gov/edgar
6 We limited ourselves to the commercially-licensed sub- set: https://huggingface.co/datasets/joelniklaus/ Multi_Legal_Pile_Commercial
7 https://www.govinfo.gov/
8 https://huggingface.co/datasets/ymoslem/Law-StackExchange
9 https://github.com/umarbutler/open-australian-legal-corpus-creator
10 Scraped	from	https://eur-lex.europa.eu/ homepage.html
11 https://www.legislation.gov.uk/
12 Obtained from CourtListener: https://www. courtlistener.com/. We use Whisper (Radford et al., 2022) to transcribe the audio files.
13 https://bulkdata.uspto.gov/

There is quite a lot of overlap between the differ- ent sources, and we run very aggressive cleaning and deduplication steps, described in Section 3.1.2. Replay Sources To reduce the risk of catas- trophic forgetting (McCloskey and Cohen, 1989) during continued pretraining, we incorporate data from the prior training distribution, following prior literature (Chen et al., 2023; Sun et al., 2020). How- ever, since the training data for Mistral is undis- closed, we introduce commonly available “gen-

issues. Additionally, we removed repeated whites- pace (spaces, new lines, and tabs), as well as any HTML tag that made it through our pipeline. Perplexity filtering We trained a KenLM model (Heafield, 2011) on a small subset of carefully in- spected legal data, and used it to filter any high per- plexity paragraph. This removed non-English text as well as most of the “weird” unicode sequences present in the data. We show some of the most common 10-grams in the filtered data on Table 2.

eral” data from Wikipedia, StackExchange, and

GitHub, comprising roughly 2% of the final train- ing mix. These datasets are sampled from SlimPa- jama (Shen et al., 2023; Computer, 2023; Soboleva et al., 2023). Instruction Sources Additionally, we found it beneficial to include conversational data during pretraining. This is inspired by recent advances in neural machine translation, which highlight that the robust capabilities of LLMs in translation are due to the existence of accidental parallel data in the training corpus (Anil et al., 2023; Briakou et al., 2023). Specifically, this means that we include the Super Natural Instruction (Wang et al., 2022) and FLAN collection (Longpre et al., 2023) during pretraining.

3.1.2 Data Cleaning

A significant fraction of the collected data is ei- ther in PDF files or is text extracted from PDFs14. This means that the text has some artifacts, includ- ing i) page numbers in the middle of sentences; ii) line numbers; iii) non-normalized unicode charac- ters; iv) broken lines of text; v) repeated characters: new lines, dashes, etc; vi) other artifacts. We ad- dressed these issues using a combination of rules and heuristics to filter the data. Text Normalization We normalize all unicode with the NFKC method, available through the unicodedata Python package. Rule filters Following Elazar et al. (2023), we found the most common 10-grams in our dataset and used regular expressions to remove the unde- sired ones, which were mostly repeated characters. Concretely, 8 of the top 10 10-grams in the original data were repeated characters, eg: “- - - - - - - - - -”, “. . . . . . . . . .”, or “* * * *

  • * * * * *”, and weird characters, ie encoding

14We used Poppler for text extraction from PDF files.

Common 10-grams


have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to rejected under 35 U.S.C . 103 as being unpatentable over

Table 2: Most common 10-grams in the pretraining dataset.

3.1.3 Data Deduplication

Inspired by Kocetkov et al. (2023); Lee et al. (2021), we removed duplicates and near-duplicates from the training data using Mou et al. (2023), with default parameters, after which we were left with roughly 30B tokens of high-quality text. 3.2 Instruction Finetuning Mixes Instruction fine-tuning is crucial for getting the best performance out of the pre-trained decoder models across different tasks. We use a mix of general and legal instructions to train the model to understand and follow instructions well, with a focus on legal expertise.

General Instructions When it comes to general instructions, we gather them from four primary sources:

  1. SlimOrca This subset of the FLAN collection comprises generic instructions, offering a fo- cused resource for various tasks (Mukherjee et al., 2023; Lian et al., 2023).
  2. Meta Math Question Answering Instructions: Designed for mathematical inquiry, this dataset15 presents a range of mathematical questions, facilitating research in math-based natural language processing (Yu et al., 2023).
  3. General Conversations from UltraChat: Capturing diverse conversational contexts, this GPT-derived dataset contributes to en- hancing natural language understanding and generation systems (Ding et al., 2023).
  4. Code Instructions from Glaive Code Assis- tant v216 Training on code has been shown to increase the reasoning ability of models (Ma et al., 2023)

We meticulously filter, deduplicate, and curate all this data, resulting in a refined dataset compris- ing 600K instructions.

Legal Instruction Construction
We syntheti- cally generate comprehensive conversations ad- dressing fundamental legal competencies across multiple legal document types (Ding et al., 2023). We leverage a Mistral-7B-instruct to transform legal texts augmented with metadata into coherent conversations. The methodology involves initiat- ing the conversation with 3 predefined turns: (1) the user articulates a request related to the legal document, (2) the assistant responds by rephras- ing the metadata (e.g., document type, date, name of a judge), and (3) the user prompts the assistant to elaborate on its reasoning. Subsequently, we extend the conversation through a series of turns, where a user model progressively poses more spe- cific questions to grasp the assistant’s reasoning. Si- multaneously, an assistant model provides in-depth insights. An illustrative example is presented in Figure 2. Notably, we ensure the exclusion of the test set from existing benchmarks.
15 Accessible at meta-math/MetaMathQA

4 Evaluation of Legal Knowledge

To evaluate the model’s legal abilities, we use 3 benchmarks (i) we compare the perplexity of the backbones on 5 types of legal documents, (ii) we enhance LegalBench with LegalBench-Instruct for deeper evaluation, (iii) we rely on the legal section of MMLU for additional insights. Perplexity Measurement To evaluate the adapt- ability of the backbones to legal documents, we assess perplexity using benchmark datasets span- ning four distinct legal domains: contracts, judicial decisions, opinion text, and legislation. We ensure that the datasets are up-to-date, and sourced after the collection cut-off date from LLM data. Specifi- cally, contract data is sourced from EDGAR (first quarter of 2024), legal decisions from ICSID court decisions published after October 2023, legislation focuses on US bills submitted before the House or Senate after October 2023, and party submissions include Texas briefs submitted after October 2023. During our investigations, we found a significant limitation in the original prompts of LegalBench. The complex nature of these prompts, combined with the challenges encountered by open source LLMs in adhering to instructions - particularly in handling formatting - leads to a substantial drop in performance (as measured by accuracy). The generated sentences are often verbose and difficult to parse, rendering LegalBench in its current form too stringent and failing to accurately gauge im- provement on the task.

16 Available at https://huggingface.co/datasets/ glaiveai/glaive-code-assistant-v2

Figure 2: Turning dataset with metadata into a con- versation. Taking the example of Reddit post classifi- cation, we turn a labeled example {"My employer fired me because . . . Is it legal?", "employment" }, we hard- code the first three turns of the conversation by simply reformulating the query and answer as a natural conver- sation. We then complete the conversation using a user model(blue dashed), whose task is to continue generat- ing relevant questions from the ongoing conversation, and an assistant model that provides answers. Both assistant and user models are Mistral-7B-instruct.

For example, in some of the tasks, performance is evaluated by the first word the model predicts, and this word is expected to be a Yes/No. This means that if the response is a bit verbose it will be counted as incorrect, even if a human would classify it as a correct answer. To remedy this shortcoming, we refine the prompts by 1) removing distracting few-shot examples and 2) concluding with a specific instruction for the model to generate tags (see Table 3). Massive Multitask Language Understanding (MMLU) The MMLU benchmark (Hendrycks et al., 2020) has been widely employed to gauge


Original Prompt


The Telemarketing Sales Rule is provided by 16 C.F.R. § 310.3(a)(1) and 16 C.F.R. § 310.3(a)(2). Question: Acme Toys is a telemarketer subject to the Telemarketing Sales Rule. Acme Toys told a customer that its frisbees cost $10 each, when in fact the frisbees cost $12 each. The customer agreed to the sale and was charged $12. Is this a violation of the Telemarketing Sales Rule? Answer: Yes

Question: Acme Toys is a telemarketer subject to the Telemarketing Sales Rule. Acme Toys told a customer that its frisbees cost $10 each, when in fact the frisbees did cost $10, but Acme Toys did not disclose that shipping would cost an additional $5. The customer agreed to the sale. Is this a violation of the Telemarketing Sales Rule? Answer: Yes

Question: Acme Industrial Products is a telemarketer subject to the Telemarketing Sales Rule. Acme Industrial Products told a customer that its brooms cost $12 each, and the brooms did in fact cost $12. The customer agreed to the sale. Is this a violation of the Telemarketing Sales Rule? Answer: No


Saul-7B (final)


Saul-7B (interm.)


Mistral-7B Llama2-7B

Question: Acme Industrial Products is a telemarketer subject to the Telemarketing Sales Rule. Acme Industrial Products told a customer that it would sell them 4 brooms for $10 and that shipping would be $5. Then, the customer agreed to the sale. Is this a violation of the Telemarketing Sales Rule? Answer: No Question: {text} Answer:

Curated Prompt (Ours) The Telemarketing Sales Rule is provided by 16 C.F.R. § 310.3(a)(1) and 16 C.F.R. § 310.3(a)(2). Answer the following question: {text} Answer by only outputting "Yes" or "No"


Table 3: Example from LegalBench-Instruct. We manually curated and corrected typos, removing a few short examples from LegalBench as they were found to distract LLMs of size 7B. the advances in LLM performance. In our study, we center our analysis on the legal domain, with a specific focus on: international law, professional law, and jurisprudence. Those tasks respectively contain 120, 1500, and 110 examples.

4.1 Metrics

We use the same metric as the original Legal- Bench (Guha et al., 2023) paper: balanced accu- racy. Balanced accuracy allows for handling better- imbalanced classification tasks, such as the ones presented in both benchmarks. We also use bal- anced accuracy for the legal tasks of MMLU. Un- less otherwise noted, any score reported throughout this section refers to the balanced accuracy.

5 Experimental Setting

5.1 Baselines

We compare the SaulLM-7B family to other state-of-the-art 7B and 13B open-source models. Concretely, we include the following instruction and DPO finetuned variants of Mistral-7B (Jiang

et al., 2023): Mistral-7B-Instruct-v0.1, Mistral-7B-Instruct-v0.2 , as well as zephyr-7b-beta17. We also evaluate the Llama2 (Touvron et al., 2023a) family, more specifically Llama2-7b-Chatand Llama2-13b-Chat.
Figure 3: Performance of base models on LegalBench- Instruct. Interestingly, although not instruction fine- tuned, SaulLM-7B is still able to achieve impressive improvements on the benchmark, compared to other base models, including SaulLM-7B’s initial checkpoint (Mistral-7B).

5.2 Implementation Details

Codebase Our codebase relies on open-source frameworks (Shoeybi et al., 2019; Wolf et al., 2019; Lhoest et al., 2021) utilizing DeepSpeed (level 3) with Flash attention (Dao et al., 2022; Dao, 2023). It is built on PyTorch (Paszke et al., 2019), and our models are available on the Huggingface hub. Compute Continuous pretraining utilizes 256 MI250 AMD GPUs. For instruction fine-tuning, workload distribution occurs across 16 MI250. Evaluation procedures are seamlessly conducted on a single MI250.

6 Results

In this section, we discuss our main experimental findings and results.

6.1 LegalBench-Instruct

Figures 3 and 4 summarize our results on LegalBench-Instruct. There are 3 main takeaways, which we discuss below.

17 https://huggingface.co/HuggingFaceH4/zephyr-7b-beta


Saul-IFT Saul-7B-IFT (Generic only)

Mistral-7B-v1


Figure 4: Influence of the base model. Start- ing the instruction finetuning from our base model SaulLM-7B brings noticeable improvements compared to the Mistral-7B. Indeed, even with a generic IFT mix (without legal), SaulLM-7B (Gen.) outperforms its Mistral-Instruct counterpart significantly. Adding le- gal instructions to the IFT mix further boosts the results.

I. Legal continued pretraining brings signifi- cant improvements We start by analyzing the impact of our proposed continued pretraining. As seen on Figure 3, SaulLM-7B is a strong stan- dalone model. We speculate that its strong per- formance is largely due to the integration of in- structions in the pre-training data, as mentioned in subsubsection 3.1.1. Nevertheless, we still note that even without a dedicated instruction fine- tuning stage, SaulLM-7B performs on par with Llama2-7B-chat (0.38 v.s. 0.39). More impor- tantly, SaulLM-7B serves as a strong base model for building IFT models with strong legal capa- bilities. When combined with Generic instruction finetuning, as seen on Figure 4, it achieves a strong average of 0.59, i.e. 4 absolute points of improve- ment with respect to the best open-source instruct model Mistral-7B-Instruct-v0.1. II. Legal instruction finetuning further boosts the results As seen on Figure 2, finetuning SaulLM-7B on both general and legal instructions (SaulLM-7B-Instruct) establishes a new state- of-the-art on the LegalBench-Instruct benchmark, with an average score of 0.61, i.e. an 11% relative improvement compared to the best open-source in- struct model (Figure 5. Finally, DPO-aligned mod- els tend to underperform their instruction-tuned counterparts, which could be explained by the fact that generic alignment is not suited for out- of-distribution tasks, such as the ones present in LegalBench-Instruct. Although beyond the scope of the present work, an interesting research direc- tion would be to explore how legal-specific DPO can help.


Figure 5: Comparison of instruct models on LegalBench-Instruct. SaulLM-7B-Instruct estab- lishes the state-of-the-art, outperforming the best Mistral-Instruct model by a significant 6 absolute points.

Jurisprudence

Professional

International

Figure 6: Instruct models on Legal-MMLU. Echoing finding on LegalBench-Instruct, SaulLM-7B-Instruct displays superior performance on all three tasks of Legal-MMLU, with an average abso- lute improvement of 5 points with respect to Mistral-7B-Instruct-v0.1.

III. There is still room for significant improve- ment. Next, we follow the original LegalBench

taxonomy (Guha et al., 2023) to gain a more gran- ular understanding of SaulLM-7B-Instruct’s per- formance, by partitioning the tasks into 5 core legal abilities: ISSUE SPOTTING, RULE-RECALL, IN- TERPRETATION, RHETORIC UNDERSTANDING, and RULE-CONCLUSION. Results show an in- teresting trend (Figure 7): SaulLM-7B-Instruct shows clear superior performance over the best non- legal competitor Mistral-7B-Instruct-v0.1 on the four areas that require the most legal exper- tise, i.e. ISSUE, RULE, INTERPRETATION and UN- DERSTANDING. On the other hand, it falls short of Mistral-7B-Instruct-v0.1 on the CONCLU-


30

25

20

15

10

5


Contract Legal

Decisions

Legislation Party

Submissions

SION tasks, which interestingly require much more pure deductive reasoning than actual legal knowl- edge. We speculate that augmenting our pretraining and fine-tuning corpora with more deductive rea- soning content, including but not limited to math- ematics datasets could reduce the gap and fully unlock the potential of SaulLM-7B-Instruct.

Mistral-Instruct-v0.1 SaulLM-Instruct rules rhetoric issue interpretation conclusion 0.4 0.5 0.6 0.7 Balanced accuracy

Figure 7: Per-task performance breakdown. SaulLM-7B-Instruct largely outperforms generic In- struct models on tasks that most require legal-specific knowledge, but is outperformed by Mistral-Instruct on the conclusion tasks, which necessitates more deductive reasoning.

6.2 Results on Legal-MMLU

To confirm our observations on LegalBench- Instruct, we analyze the results on Legal-MMLU shown in Figure 6. Again, SaulLM-7B-Instruct exhibits consistent superiority over non-legal instruction-tuned models, with a gap between 3 and 4 absolute points to the best 7B open-source competitor across the three tasks, providing addi- tional evidence that SaulLM-7B-Instruct is as a strong foundation to build models tailored to legal workflows.

Figure 8: Perplexity on legal documents for pre- trained backbones. SaulLM-7B-Instruct outper- forms other pretrained backbones on most types of le- gal documents, but is outperformed by Llama2-7b on Legislation. SaulLM-7B-Instruct exhibits a median perplexity of 8.69, having a reduction of 5.5 percent compared to Mistral-7B, 9.20, and 10.8 percent com- pared to Llama2-7B, with a median perplexity of 9.74.

6.3 Perplexity Analysis

To assess the adaptation of SaulLM-7B backbone to the legal domain, we present perplexity scores across four document types: contracts, legal de- cisions, legislation, and party submissions. Re- fer to Figure 8 for the results. Our model, SaulLM-7B, consistently outperforms Mistral-7B across all categories, exhibiting lower average per- plexity scores with reduced variance. Interestingly, Llama2-7B demonstrates lower perplexity specif- ically in legislation documents, suggesting a po- tentially higher proportion of legislative text in the pertaining corpora compared to Mistral-7B. Overall, compared to Mistral-7B, our model shows a median perplexity reduction of 3 percent across legal corpora and 11 percent when compared to Llama2-7B. 7 Conclusion & Future Perspectives In this paper, we introduce SaulLM-7B, an open- source decoder model delivering state-of-the-art performance, compared to 7B models, within the le- gal domain. Our approach entails fine-tuning legal data alongside instruction fine-tuning on synthetic datasets. Additionally, we contribute by providing a cleaned version of LegalBench and introducing a new set of documents for perplexity measurement. We hope that our model, which is released under the MIT license, will contribute to the open-source ecosystem and the community.


References

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preo¸tiuc-Pietro, and Vasileios Lampos. 2016. Pre- dicting judicial decisions of the european court of human rights: A natural language processing per- spective. PeerJ computer science, 2:e93.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. The falcon series of open language models.

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx- uan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report.

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language mod- els across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.

Eleftheria Briakou, Colin Cherry, and George Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. arXiv preprint arXiv:2305.10266. Umar Butler. 2023. Open australian legal corpus. Yihan Cao, Yanbin Kang, and Lichao Sun. 2023. In- struction mining: High-quality instruction data se- lection for large language models. arXiv preprint arXiv:2307.06290.

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Ale- tras. 2019. Neural legal judgment prediction in en- glish. arXiv preprint arXiv:1906.02059.

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- siotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559. Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language mod- els. arXiv preprint arXiv:2311.16079.

Daixuan Cheng, Shaohan Huang, and Furu Wei. 2023. Adapting large language models via reading compre- hension. arXiv preprint arXiv:2309.09530.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar- ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Together Computer. 2023. Redpajama: an open dataset for training large language models.

Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conver- sations. arXiv preprint arXiv:2305.14233.

Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhi- lasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, and Jesse Dodge. 2023. What’s in my big data?

Manuel Faysse, Patrick Fernandes, Nuno Guerreiro, António Loison, Duarte Alves, Caio Corro, Nico- las Boizard, João Alves, Ricardo Rei, Pedro Mar- tins, et al. 2024. Croissantllm: A truly bilin- gual french-english language model. arXiv preprint arXiv:2402.00786.

Manuel Faysse, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2023. Revisiting instruction fine- tuned model evaluation to guide industrial applica- tions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling.

Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

Neel Guha, Daniel E Ho, Julian Nyarko, and Christo- pher Ré. 2022. Legalbench: Prototyping a collabo- rative benchmark for legal reasoning. arXiv preprint arXiv:2209.06120. Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel N Rockmore, et al. 2023. Legalbench: A collabo- ratively built benchmark for measuring legal rea- soning in large language models. arXiv preprint arXiv:2308.11462. Suchin Gururangan, Ana Marasovic´, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, and Marta Villegas. 2021. Spanish legalese language model and corpora. arXiv preprint arXiv:2110.12201. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Com- putational Linguistics. Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217–29234. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300.

Niful Islam, Debopom Sutradhar, Humaira Noor, Jarin Tasnim Raya, Monowara Tabassum Maisha, and Dewan Md Farid. 2023. Distinguishing human generated text from chatgpt generated text using ma- chine learning. arXiv preprint arXiv:2306.01761. Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ana- niadou, Erik Cambria, and Jörg Tiedemann. 2023. Domain-specific continued pretraining of language models for capturing long context in mental health. arXiv preprint arXiv:2304.10447. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gi- anna Lengyel, Guillaume Bour, Guillaume Lam- ple, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mix- tral of experts. Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2023. Gpt-4 passes the bar exam. Available at SSRN 4389233. Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, and Harm de Vries. 2023. The stack: 3 TB of permissively li- censed source code. Transactions on Machine Learn- ing Research. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. 2021. Datasets: A commu- nity library for natural language processing. arXiv preprint arXiv:2109.02846. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,

Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan- ish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. Starcoder: may the source be with you! Wing Lian, Guan Wang, Bleys Goodson, Eugene Pent- land, Austin Cook, Chanvichet Vong, and "Teknium". 2023. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification. Daniele Licari and Giovanni Comandè. 2022. Italian- legal-bert: A pre-trained transformer language model for italian law. In CEUR Workshop Proceedings (Ed.), The Knowledge Management for Law Work- shop (KM4LAW). Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688. Keming Lu, Peter Potash, Xihui Lin, Yuwen Sun, Zihan Qian, Zheng Yuan, Tristan Naumann, Tianxi Cai, and Junwei Lu. 2023. Prompt discriminative language models for domain adaptation. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 247–258. Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, and Shanshan Li. 2023. At which training stage does code data help llms reasoning? Lauren Martin, Nick Whitehouse, Stephanie Yiu, Lizzie Catterson, and Rivindu Perera. 2024. Better call gpt, comparing large language models against lawyers. arXiv preprint arXiv:2401.16212. Michael McCloskey and Neal J. Cohen. 1989. Catas- trophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychol- ogy of Learning and Motivation, pages 109–165. Aca- demic Press. Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detec- tion using probability curvature. arXiv preprint arXiv:2301.11305.

Chenghao Mou, Chris Ha, Kenneth Enevoldsen, and Peiyuan Liu. 2023. Chenghaomou/text-dedup: Ref- erence snapshot. Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- har, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of gpt-4. Rémi Munos, Michal Valko, Daniele Calandriello, Mo- hammad Gheshlaghi Azar, Mark Rowland, Zhao- han Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. 2023. Nash learning from human feedback. arXiv preprint arXiv:2312.00886. Joel Niklaus, Ilias Chalkidis, and Matthias Stürmer. 2021. Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark. arXiv preprint arXiv:2110.00806. Joel Niklaus and Daniele Giofré. 2022. Budget- longformer: Can we cheaply pretrain a sota le- gal language model from scratch? arXiv preprint arXiv:2211.17135. Joel Niklaus and Daniele Giofré. 2023. Can we pretrain a sota legal language model on a budget from scratch? Association for Computational Linguistics. Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2023. Multilegalpile: A 689gb multilingual legal corpus. Kosuke Nishida, Kyosuke Nishida, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Unsupervised do- main adaptation of language models for reading com- prehension. arXiv preprint arXiv:1911.10768. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32. Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116. Henry Prakken. 2013. Logical tools for modelling legal argument: a study of defeasible reasoning in law, volume 32. Springer Science & Business Media. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervision. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language

model is secretly a reward model. arXiv preprint arXiv:2305.18290. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. Jaromir Savelka, Kevin D Ashley, Morgan A Gray, Hannes Westermann, and Huihui Xu. 2023. Explain- ing legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ilic´, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. 2023. Slimpajama-dc: Un- derstanding data combinations for llm training. arXiv preprint arXiv:2309.10818. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Ja- cob R Steeves, Joel Hestness, and Nolan Dey. 2023. Slimpajama: A 627b token cleaned and deduplicated version of redpajama. Jingyuan Sun, Shaonan Wang, Jiajun Zhang, and Chengqing Zong. 2020. Distill and replay for con- tinual language learning. In Proceedings of the 28th international conference on computational linguis- tics, pages 3569–3579. Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,

Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. Don Tuggener, Pius Von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. Ledgar: A large-scale multi- label corpus for text classification of legal provisions in contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1235– 1241. Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Di- rect distillation of lm alignment. arXiv preprint arXiv:2310.16944. Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Trans- former reinforcement learning. https://github. com/huggingface/trl. Thuy-Trang Vu, Dinh Phung, and Gholamreza Haf- fari. 2020. Effective unsupervised domain adaptation with adversarially trained language models. arXiv preprint arXiv:2010.01739. Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. 2023a. How far can camels go? exploring the state of instruction tuning on open re- sources. arXiv preprint arXiv:2306.04751. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. Self-instruct: Aligning language models with self-generated instructions. Yizhong Wang, Swaroop Mishra, Pegah Alipoor- molabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Super-naturalinstructions: Generalization via declar- ative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705. Debora Weber-Wulff, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Folty`nek, Jean Guerrero-Dib, Olu- mide Popoola, Petr Šigut, and Lorna Waddington. 2023. Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19(1):26.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. 2021. Finetuned lan- guage models are zero-shot learners. arXiv preprint arXiv:2109.01652. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771. Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George Fos- ter, and Gholamreza Haffari. 2024. Adapting large language models for document-level machine trans- lation. arXiv preprint arXiv:2401.06468. Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open, 2:79–84. Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2023. A paradigm shift in machine translation: Boosting translation perfor- mance of large language models. arXiv preprint arXiv:2309.11674. Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. Adapt-and-distill: Developing small, fast and effective pretrained language models for domains. arXiv preprint arXiv:2106.13474. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhen- guo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical ques- tions for large language models. arXiv preprint arXiv:2309.12284. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2023. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 SaulLM7BAPioneeringLargeLanguagAndré F. T. Martins
Pierre Colombo
Telmo Pessoa Pires
Malik Boudiaf
Dominic Culver
Rui Melo
Caio Corro
Fabrizio Esposito
Sofia Morgado
Michael Desa
Vera Lúcia Raposo
SaulLM-7B: A Pioneering Large Language Model for Law10.48550/arXiv.2403.038832024