We Need To Talk About GenAI Accuracy

Generative AI is without doubt the most powerful technology to be applied to the legal sector since the arrival of digital tools, yet it has a serious challenge: accuracy. If legal genAI is to become all it can be this must be confronted.

Why It Matters

As David Wang, Head of Innovation at Wilson Sonsini, set out at Legal Innovators California earlier this month, the goal of leveraging AI to drive efficiency is only possible when the outputs are accurate. Otherwise those outputs undermine the entire effort. Whether this is case law research or doc review, incorrect responses do more than not help: they slow things down. I.e. genAI in its worst form could make lawyers slower in their work.

And, we have been here before. Not long after starting Artificial Lawyer in 2016, this site visited a global law firm and spoke to a partner there who had a sideline in legal technology. He announced that the firm’s senior lawyers were not going to use AI tools for contract review as they existed then, which relied on machine learning and plenty of specific training for each document set that was analysed.

When the question of ‘why?’ arose, the partner replied: ‘Well, AI tools don’t work.’

To say the least, this site was taken aback, because all the proof at the time showed that they did work – not perfectly, but they did work ‘well enough’. We were clearly defining ‘works’ in different ways. From the partner’s point of view an AI tool had to ‘just get it right’ first time, with no doubts at all. It should not need training on each matter, nor its outputs need checking, then even re-checking. It either works perfectly or it doesn’t.

Of course, ‘old’ legal AI tools have been trained now over many years and perform well on tasks such as document review. But, with genAI we have run into the same challenge: accuracy and the burden of trust. And trust is paramount, because at the end of the legal production line is a client basing their actions on what a lawyer has told them, or on a contract a lawyer has negotiated for them.

And this has been brought into sharp definition with the fallout from the Stanford HAI study, which although has its own issues, highlighted that when it comes to accuracy and genAI tools we are in a right old mess (see here).

This has led some of the larger legal tech companies to go as far as discussing the building of a consortium to develop public and shared benchmarks for accuracy (see here). But that is really just the beginning.

The Mess We Are In

Here are some key considerations that outline some of the challenges we face now in tackling the genAI accuracy issue:

What are we measuring? When it came to the first wave of legal AI tools, focused on machine learning and natural language processing, a lot of the interest was in whether a tool could accurately locate a clause, e.g. a change of control clause, in a very long contract. Often, measures such as Recall and Precision were used, i.e. was what it found related to the query, and then was that actually the right answer? But, as Tim Pullan, CEO of ThoughtRiver pointed out, even if a company says its results are ‘90%’ accurate, they can be ‘hiding shockingly low precision stats, sometimes as low as 60% (i.e. 4 out of 10 things found are not relevant/incorrect). Hallucination is by its nature a precision problem.’ Moreover, because genAI is creative – which is one of its strengths, but also a weakness – getting an answer absolutely ‘spot on’, and consistently so, can sometimes be difficult.

In which case, the question then is: what are we measuring for? That answers are returned that are ‘probably right’? Or that really are right? How do we separate out: correct/accurate vs hallucinations? Is a ‘surreal’ response (i.e. it includes made up information), but that is still partially useful, to be judged in the same way as a partial response that is factually right, but misses key information? That is to say, if we are no longer in a world of clear cut precision is ‘answer usefulness’ a measure on its own? And if so, how does that get measured?

Different levels of accuracy for different needs. GenAI can be used in many ways in a textual environment and the levels of accuracy that we may need change with it.

Fundamentally, while drafting a sentence may have a subjective aspect, a legal fact, such as a case citation is not a mostly subjective phenomenon, it’s a fact pure and simple. For example, Lawyer A needs to check a citation of a past case, which they want to refer to in a legal document for a client.

Getting it wrong is seriously bad. Hallucinations that make up a case citation, but where it looks correct, are even worse. Accuracy here matters 100%. Now take the example of the same lawyer who is drafting some overview commentary at the head of a document. This is not so much a factually based exercise as a stylistic one. Here, ‘accuracy’ is perhaps not even the word we are looking for. Perhaps it’s ‘in keeping with house style’ where we should focus? The same goes for writing a marketing email. You want the client’s name right and so on, but the bulk of the generative AI output is stylistic. Recall and precision don’t really help much.

In fact, nearly every aspect of what genAI can do changes the measurement system. Summarising a 100-page document? Well, how do we judge that? We don’t want massive errors in it, but how do we decide if it’s missed ‘something important’? Importance is a subjective issue.

Review and redrafting based on a playbook? There can be accuracy around correctly choosing text from the playbook that matches the specific legal clause. But, after that it is again subjective. How far do you want to expand that clause? How many caveats are needed? How do you decide that the redrafted clause is now ‘a good clause’? Can we just use ‘accuracy’ for that? Probably not on its own.

Should we accept inaccuracy?

Since the Stanford study there have been some comments in the legal sphere suggesting that accuracy doesn’t matter that much. One commentator made the point that genAI is like a paralegal, not a senior partner and in effect we should expect inaccuracy. We should expect to have to go back and check everything. But, this misses something essential: while going back to perhaps redraft what has been drafted, or perhaps add additional points that have been missed out – which happens in all hierarchical writing exercises, from the law, to journalism, to management consulting – those higher up in the ‘trust chain’ tend to rely on the information rising up to them as being factually correct.

For example, in journalism an editor will likely rewrite sections of a junior journalist’s article to ‘improve it’. They may remember an older news article that is relevant and insert a mention of it. But, they likely won’t go back and do the interview the junior journalist did all over again to be sure the facts are right. The editor, at some point, has to accept the junior journalist ‘got their facts right’. Otherwise the entire system would grind to a halt. And this is the challenge with genAI – it’s the base layer of truth that has to be relied upon more than anything else.

This site would argue that saying ‘don’t worry about accuracy, genAI is just like a paralegal and a senior lawyer can come back and check everything to be sure’, is not just reckless, it’s unworkable.

As noted, style and additional information can be added by those who are more senior and experienced. But, base facts tend to be trusted, as if you doubt all of those that juniors have worked on then what is the point in employing them? And in turn, if those juniors have relied on a genAI system that cannot be trusted, then we are in a real mess.

Where Do We Go From Here?

The move to create shared and public benchmarks is a great move. Companies such as Thomson Reuters and LexisNexis have both the resources and a common interest in getting this done.

However, they will need to take into account the above variations in need. Secondly, this will only work if everyone uses those benchmarks. In 2016, this site asked several AI companies if they would do a ‘bake off’ together, to show where they stood on accuracy. None of the major ones would do it. The reason: fear that a rival would look more accurate than they did.

Of course, if TR and Lexis and some other large legal tech companies can set a standard then other companies may feel compelled to join in, as if they refuse to then they will not look so good in the eyes of the market. That said, there are one or two legal tech companies in the market right now that hardly ever even speak to the press (hi Harvey!), let alone may perhaps want to be publicly transparent about their performance levels.

To conclude, this is not going to be an easy task. But, it will be an essential one. As we move from the initial amazement at what genAI can do in relation to legal’s ‘low hanging fruit’, and seek to integrate it at scale throughout a large part of the legal production line, we need to be confident as a sector in its accuracy and how we measure that.

Common, shared standards are therefore going to be essential. The hard part is going to be on agreeing what these should be and how they should be applied to all the things that legal genAI can do.

But, we will have to do this if genAI is to be all it can be in the legal world.

By Richard Tromans, Founder, Artificial Lawyer, June 2024