Posts

Add Post

« Return to Posts

Construction and deployment of AI based NLP applications for the analysis of legal texts

Construction and deployment of AI based NLP applications for the analysis of legal texts

Karleen led them to a long, dungeonlike room with no windows, a concrete floor, poor lighting, and neat stacks of white cardboard boxes with the words “Placid Mortgage” stamped on the end. It was the mountain Kyle had heard so much about. The boxes, as Karleen explained, contained the files of all thirty-five thousand plaintiffs. Each file had to be reviewed.
...
“Someday in court,” she said gravely, “it will be crucial for our litigators to be able to tell the judge that we have examined every document in this case.”

John Grisham, “The Associate”

Introduction

Analysis of large volumes of legal texts represents a common task in the practice of law firms. Handling this task may require enormous investment of time and effort, therefore significant attention is being paid to automate it using the intelligent software solutions. Recent introduction of natural language processing (NLP) technologies based on deep learning and transformers opens the new opportunities for addressing this problem.

In this article we describe the early results of our project aiming at prototyping a transformer-based NLP solution for analysis of large volumes of legal texts. We use the HuggingFace Transformers library as a source for deep learning models. We integrate these models into the application using Arhat, our own technology designed for the native deployment of deep learning inference on a wide range of computing platforms. The special attention is paid to targeting modern Intel CPUs and GPUs via oneAPI Deep Neural Network Library (oneDNN).

Objectives

The main objective of this research project is construction of a solution prototype implementing information retrieval from the selected corpus of legal texts using the question answering method. As the text corpus we use a small but representative part of Swiss private law. The solution inputs user questions and retrieves sections from the analyzed text containing the relevant information. The analyzed text and user questions are formulated in German.

The project goals include:

  • obtaining the better understanding of the problem domain,
  • evaluation of the transformers technology and its applicability to the task,
  • defining the reference solution architecture.

At this stage, we do not intend to train any models specifically for use with legal texts. Instead, a suitable existing model trained on a general collection of German texts has been chosen. We expect that the anticipated reduced accuracy in processing of legal texts will not prevent from reaching the set project goals.

Challenges

Transformer models impose rather tight limitations on an input sequence length; therefore, it would not be possible to process more than contents of one paragraph in a single inference run. Batching the input sequences can improve performance on some target platforms but, nevertheless, a method must be developed to avoid passing the entire corpus of analyzed texts, paragraph by paragraph, to the inference engine. This method should implement a powerful filtering capability for exclude from processing the larger part of the text corpus that is supposed to contain no information relevant for the user question.

It is known that performance of deep learning language models degrades when handling the language variations different from one used for its training. The legal language owns distinct features like domain specific vocabulary and sentence structure. Ideally, the transformer model should be trained using the corpus of legal texts. Attempt to use a model trained on some generally selected texts may lead to reduced performance and some method should be developed to partially mediate this.

We are addressing these challenges by combining deep learning with some basic cognitive AI techniques. The Swiss law texts are extensively annotated with the sets of hierarchically arranged phrases. We have developed a set of tools that analyze these annotations and extract the words representing the relevant legal terms. Result of this analysis is represented as a simple knowledge graph containing extracted legal terms and respective relations. Furthermore, the text corpus is fully indexed by these terms. The indexing procedure takes in account morphology of German language to handle properly different forms of each term. The user question is also analyzed for the occurrence of known terms. Paragraphs in the analyzed text that contain no common terms with the question are excluded from the evaluation.

Selection of a suitable question answering pre-trained model for German represents a separate challenge because at present the number of such publicly available models is limited. We experimented with several models and found that deutsche-telekom/electra-base-de-squad2 provides the most accurate results.

Deployment technology

Arhat is a cross-platform framework designed for efficient deployment of deep learning inference workflows in the cloud and at the edge. Unlike the conventional deep learning frameworks Arhat translates neural network descriptions directly into lean standalone executable code.

Arhat addresses engineering challenges of modern deep learning caused by high fragmentation of its hardware and software ecosystems. This fragmentation makes deployment of each deep learning model a substantial engineering project and often requires using cumbersome software stacks.

Arhat performs automated generation of software code specific for the selected deep learning model and target platform. This results in a very lean application code with minimum external dependencies. Deployment and maintenance of such applications requires much lesser effort compared to the conventional methods.

Arhat is tightly integrated with Intel oneAPI deep learning ecosystem and supports Intel hardware via oneDNN back-end. This back-end translates model specification into C++ code that runs on the top of a thin Arhat runtime that directly interacts with oneDNN. This approach results in a very slim deployable software stack that can run on any Intel hardware supporting oneDNN.

We have successfully experimented with using Arhat for deployment of the models from the HuggingFace Transformers library on various target platforms. Using Arhat leads to an easy and streamlined deployment process that does not use PyTorch or any parts of the original Transformers library. The models are converted to the compact C++ code that is directly integrated with the native platform library of deep learning primitives like oneDNN.

Reference architecture

We implemented the solution prototype as a Web application accessible to the end user via the conventional Web browsers. The application supports a simple user interface allowing the user to enter a question on natural language and sending back a collection of text fragments containing the matching answers. The returned fragments represent paragraphs in the analyzed text.

The reference architecture, as originally designed, is shown on the following figure.

The application included three components:

  • Engine
  • Tokenize microservice
  • Predict microservice

The Engine is the main component that runs behind the HTTP server. It implements all the question answering logic, extracts information from the application databases, orchestrates invocation of microservices for handling specific tasks, and composes information sent as response. The Engine is implemented in Go programming language.

The Tokenize microservice has been originally introduced solely for tokenization of the user questions. (The analyzed texts are stored along with their pre-tokenized representations and therefore do not require tokenization at runtime.) The main reason for placing this functionality in a separate microservice is choice of programming languages (Python and Rust) used by the authors of Transformers library for implementation of tokenization algorithms.

Keeping a separate microservice for handling of such a tiny function did not look optimal. Therefore, soon after completion of the first prototype, we reimplemented in Go the WordPiece tokenizer used for BERT and related models. We integrated the new tokenizer directly into the Engine thus completely removing the Tokenize microservice. This upgrade led to a leaner architecture, made the stack of required programming languages more homogeneous by reducing it to just Go and C++, and fully eliminated the need to deploy any parts of Transformers or PyTorch on the target platform.

The Predict microservice implements inference for the batched input generated by the Engine. It hosts the lean executable code generated using Arhat for the selected transformer model and the respective model weight data. This component is implemented in C++ and does not require deployment of any parts of the original Transformers library. On Intel platforms, oneDNN library is used for the efficient handling of deep learning operations.

The transformer model assigns a numeric confidence score to each analyzed fragment. The Engine selects the fragments with the score above a heuristically defined threshold and returns them to the user.

The Engine uses two databases containing the corpus of legal texts and the knowledge graph respectively. The text corpus database stores both natural German and tokenized representations. The knowledge graph database contains legal terms, relations between them, and pointers to the text corpus. The morphological information related to the terms technically is also part of this database.

Conclusion

We have evaluated the implemented prototype solution using a representative set of questions. The application reliably retrieves the relevant information. For some questions a significant number of false positives is returned. Nevertheless, its functionality is already vastly superior to the conventional full text search in terms of accuracy. This makes our approach promising and worth the further research.

Using a simple knowledge graph of legal terms improves accuracy and results in a good retrieval speed but does not fully eliminate the false positives when the relevant information is contained in the surrounding context but not directly available in the analyzed paragraph. As the future improvement, we plan to extend our automated analysis of annotations in the law text and establish the affinity relations between the legal terms and larger regions in the text corpus. These relations can be taken in account during the evaluation of the score for each potential answer.