AG Proves LLMs Can Do Well On Due Diligence

UK-based law firm, Addleshaw Goddard (AG), has tackled the challenge of using generative AI for large-scale tasks such as due diligence and found that LLMs can deliver high accuracy….if you use the right methodology. Due to issues related to the sheer scale of such matters many legal tech experts have tended to rely on older ML/NLP techniques that are proven to work. So, AG’s project is very illuminating.

In a long and detailed report they show exactly how they tested various methods to find out what could be done. For example, they used different LLM models and different approaches to ‘chunking’ i.e. taking limited blocks of text from documents to ensure better accuracy, among other techniques such as sophisticated prompting.

As they explain: ‘This comprehensive research report, which we believe is the first of its kind from a law firm, sets out the work carried out by our Innovation Group to develop and test a robust method of using Large Language Models to review documents in the context of an M&A transaction legal due diligence project.’

After multiple tests AG came to the finding that: ‘Our testing has shown that, through optimised retrieval techniques and improved prompting approaches, we can increase the accuracy of LLMs in commercial contract reviews from 74% to 95%, on average.’

And that: ‘It is possible to optimise LLMs using a range of components to increase performance – in some cases by quite a margin. Our findings have highlighted the importance of Prompt Engineering, the use of follow-up prompts and the careful process of optimising retrieval components in increasing LLM performance.’

They added that they ‘wanted … to go beyond what we had built with AGPT [their internal LLM-connected application] and be more than a simple wrapper for an LLM. We pursued this project ourselves in order to find a solution that balances flexibility, control, reliability and cost effectiveness. The development of our PoC and the findings along the way is a significant milestone on this journey.’

AG also highlighted that how you handle chunking is really central to better outcomes, especially when you have a high volume task such as due diligence.

For example: ‘Combining a good Chunking Strategy with other retrieval components resulted in a clear accuracy improvement of between 14% and 22%.’

However, prompting could help and hinder things: ‘We discovered that giving LLMs a more detailed and bespoke message does improve the quality of its responses, but being too granular did have a detrimental effect.’

And also: ‘Humanising the information given to the models also showed some performance improvement – asking a model to pay extra attention and accusing it of missing relevant information both led to higher accuracy – with improvements of up to 16% in our experiments.’

(On a side note, at another law firm this site visited recently, they have found that offering a ‘tip’ or ‘bribe’ of around £100 within the prompt actually leads to better performance of an LLM as well.)

And also AG found that: ‘A RAG approach has major advantages over a Full In-Context configuration as you can accurately feed the LLM with the minimum amount of context needed.’

I.e. it’s better to use small amounts of text and then check how it’s going to see you’re getting accurate answers.

In summary the firm stated they found that:

‘Following our testing, we found the best performing configuration to be as follows:

1. Using Chunking Strategy 2 (see report) – ‘chunks 3,500 characters long, with an overlap of 700 characters either side’.

2. Implementing a customised hybrid Retrieval Method combining both advanced keywords search and vector search.

3. Creating an optimised Advanced Keywords Query and Vector Search Query for each Provision.

4. Retrieving the Top 10 Chunks and feeding them back to an LLM in the order they appeared in the document.

5. Using GPT4-32K as the LLM for the task.

6. Setting the LLM Parameters as temperature 0, maximum tokens to 2,000, and a constant ‘seed’ value.

7. Drafting a targeted System Prompt that did not unduly increase the context length fed to the LLM.

8. Creating Provision Specific Prompts, improved by our findings in this research, that direct the LLM towards what it should be doing.

9. Employing a Follow Up Prompt asking the model to pay special attention to certain aspects and directly accusing it of missing information where necessary. We aimed to provide concrete examples and give some context to the rhetoric in the market in relation to the use of LLMs for legal work. While we still have a long way to go before we can create some of the solutions we want, we are already seeing a lot of value from GenAI in the work we do every day. We would welcome any comments and feedback following this paper and hope that sharing this approach drives a wider discussion across law firms, legal service providers, in-house teams, legal tech solution providers and academics.’

Is this a big deal? Yes, it is. Many, including this site, have seen due diligence as a real challenge for LLMs due to the huge volume of material and its complexity. LLMs do best on specific tasks that are clearly understood. The larger the task, the more risk things will go astray.

That matters, as several legal tech companies are using LLMs to help with due diligence, meanwhile there are many companies in the market which sell ML/NLP solutions.

In fact, AG added on that point: ‘ML extraction is still effective at finding and extracting clauses; however, a well optimised retrieval approach using LLMs is close to or on a par with this performance. There is an added advantage to using GenAI as it is possible to add new Provisions on the go, rather than labelling examples to run a supervised machine learning process, with the only overhead being the drafting of specific prompts.

‘We can get to an answer quicker using GenAI, but this is only a clear advantage for bespoke extractions as most of the solutions in the market have a large list of pre- trained concepts.

‘An additional benefit with GenAI is the ability to get to the next stage of querying the extractions to identify risks – this is the next focus of our research and our work to date has shown this to be effective. Using LLMs, we can create specific risk query prompts that get us to an answer, rather than just flagging the language a human would need to check.’

So, there you go. Great work by all of the Addleshaw Goddard team, and in particular Kerry Westland, Elliot White, Mike Kennedy, and Ron Raini.

The report can be found here.