2023 TextMiningLegalDocumentsforClau

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Contract Clause Extraction.

Notes

Cited By

Quotes

Abstract

Natural Language Processing (NLP) solutions for legal contracts have been the preserve of large law firms and other industries (e.g., investment banks), especially those with large amounts of resources, having both the volume and range of legal documents and manpower to label the training data. The findings suggest that it is possible to use a smaller volume of training contacts and still generate results that are within an acceptable range. Our results show that just 120 training contracts trained on a pre-trained language model can generate results that are within 10% of the same model trained on 3.3 times the volume. In conclusion, smaller law firms could benefit from machine learning NLP solutions for clause extraction.

Introduction

Legal Documents are common in both business and personal worlds, used to create a written legally binding contract between two or more parties. While some of these can have very standard templates, such as Credit Card Terms and Conditions (which become a legal agreement once the card is used), others will be very bespoke, such as the building and running of a nuclear power station. The majority of existing legal documents have been created by a lawyer in a legal firm or from in-house lawyers (large firms would have their legal department), created using a word processor, printed and physically signed. The legal wording created has historically been unique to that contract, where the writing style of the lawyer has come into play, which has resulted in a wide variety of clause texts available for each legal clause [1].

Early software solutions to extract the legal clauses from the documents have been mainly rules-based, requiring specialised teams to review large volumes of documents to look for variations in each clause type and write complex rules to extract these terms from other documents [2]. As machine learning and artificial intelligence have developed over the years, new software solutions have been developed for the legal industry. One of these machine learning technologies, Natural Language Processing (NLP) is becoming common in several software solutions [3]. They have been created to allow law firms to utilise electronic copies of their legal document to find and extract the clause text, which can be used in activities such as Legal Research, Electronic Discovery, Contract Review and Document Automation [4].

In Fig 1 we show the process of generating the training and test documents:

  1. Select some documents at random, e.g., 50, and find and label the required clauses in these documents
  2. Split the labelled documents into training/testing, e.g., 40/60%, and include the remaining unlabelled documents in the testing dataset
  3. Run training using the training dataset, i.e., the labelled documents
  4. Evaluate the testing metrics, only using the labelled portion of the testing dataset:
    1. If the metrics are low/unsatisfactory, then select another number of documents, and with the assistance of the predictions from the previous training/testing, label those clauses. Go to step ii.
    2. If the metrics are satisfactory, then the model is complete, so end process.

Natural Language Processing (NLP) is being used in the legal services sector [5], covering the five main processes that legal firms are interested in, which are:

  • Legal Research: the process of finding relevant information from legal documents to support legal decision making [6].
  • Electronic Discovery: the process of finding relevant files, then finding information from them, which is used to support several use-cases
  • Contract Review: the process of reviewing and amending contracts [7]
  • Document Automation: the process of creating new legal documents, by utilising existing legal documents. [8]
  • Legal Advice: the process where legal advice is provided based on existing legal documents and laws Term Extraction would be required for all of these activities, i.e., to find and extract the relevant legal text within the legal documents [9].

The main commercial products that are available, some of these products have been in existence for a long time, such as the case LexisNexis from the early 1970s which has built a huge database of content (claimed to be 30TB), Thomson Reuters or Bloomberg Law, which all have subscription-based services to access their content. However, newer legal tech firms in the market are capturing market share by offering smarter technologies, such as Machine Learning and NLP to Fig. 1. Overall system operation of data throughput and transformations improve the accuracy and precision of searches, utilising their clients' legal document repositories.

II. Related Work

Previous work has focused on measuring text similarity, proposing a method that combines different measurements, which includes sentence structure, word-to-word and word order similarities [10]. Sentence structure similarity involves parsing the sentence, which includes Parts-of-Speech tagging, Grammar tagging and Named Entity Recognition. The next step is to use the parsed and tagged sentence and generate a semantic representation graph. The graph captures the structure information of the sentence. Word-to-word similarity involves finding similarities between words in the two sentences being compared, the more similar the words, the higher the score. Word order similarity involves finding similar words in the same order.

Key to the process of text extraction of the text elements and how they can be automated for legal contracts [11]. This involved the creation of a labelled dataset of approx. 3,500 English contracts, which have been tagged with 11 types of elements (e.g., contract title, party, governing law). Their dataset has been encoded, so that each word (token) is represented by an integer number (e.g., termination is represented by 3156), and any words not in the vocabulary are represented as UNK. Each token in the labelled dataset is they followed by an element tag. One of the reasons for their work is that contract element extraction is currently a mostly manual exercise, which can be tedious and expensive. They look at Named Entity Recognition and how it relates to their research. One key point is that while NER can find certain entities (dates, amounts, etc.), it doesn’t necessarily determine the type of date (e.g., contract start or termination date) or amount types (e.g., monthly rent payments or collateral fees).

Deep learning for NLP using a comparative study of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [12] was performed by Yin. For the RNN they look at two types, these being the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which use gating mechanisms that were developed to aid some limitations of the RNN. CNN’s can be considered hierarchical and RNNs as sequential architectures. Other types of networks that can be utilised for NLP include Bidirectional RNN, Deep (Bidirectional) RNN and Recursive Neural Networks (RCNN). A recent development is the Transformer – a deep learning model that uses self-attention mechanisms, a technique that mimics cognitive attention [13]. Transformers were designed and developed by a team of eight Google researchers working on ‘Google Brain’ or ‘Google Research’. The Transformer dispenses with recurrent (RNN) and convolution (CNN) neural networks and is based solely on attention mechanisms. Transformers are designed to process sequential input data and process the entire input all at once and can provide context for any position in the input sequence, and get information about faraway tokens. Transformers are now used in a number of pretrained language models, which include Generative Pre-trained Transformer (GPT) 2, GPT-3, BERT, RoBERTa, ALBERT. These models are being used to perform a variety of NLP tasks such as language translation, named entity recognition, document generation, and question-and-answering.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 TextMiningLegalDocumentsforClauTony Vidler
Kenneth McGarry
David Baglee
Text Mining Legal Documents for Clause Extraction2023