Problematic Stanford GenAI Study Takes Aim at Thomson Reuters + LexisNexis

A controversial Stanford University academic paper has taken aim at Thomson Reuters’ and LexisNexis’ generative AI capabilities. However, problems with the study have already emerged, such as using the wrong data sets to get case law results. Artificial Lawyer takes a look.

The study ‘Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools’, conducted by the Human-Centred AI group or ‘HAI’, within the esteemed Stanford University, claims that:

‘Overall hallucination rates are similar between Lexis+ AI and Thomson Reuters’s Ask Practical Law AI … but these top-line results obscure dramatic differences in responsiveness. As shown in Figure 4 (see below), Lexis+ AI provides accurate (i.e., correct and grounded) responses on 65% of queries, while Ask Practical Law AI refuses to answer queries 62% of the time and responds accurately just 18% of the time. When looking solely at responsive answers, Thomson Reuters’s system hallucinates at a similar rate to GPT-4, and more than twice as often as Lexis+ AI.’

However they then note:

‘Some of these disparities can be explained by the Thomson Reuters system’s more limited universe of documents. Rather than connecting its retrieval system to the general body of law (including cases, statutes, and regulations), Ask Practical Law AI draws solely from articles about legal practice written by its in-house team of lawyers. This limits the system’s coverage, as Practical Law documents cover a small fraction of legal topics.’

Meanwhile, well-known legal tech expert, Greg Lambert, who is Chief Knowledge Services Officer at US firm Jackson Walker, and co-founder of the 3 Geeks and a Law Blog, underlined the problem.

He stated on social media, referring to a test question the researchers had tried and then singled out as an example of genAI’s failings (see below):

‘If you’re going to write a paper about how the legal research AI tool hallucinates, please use the ACTUAL LEGAL RESEARCH TOOL WESTLAW, not the practicing advisor tool Practical Law [of Thomson Reuters].

‘Just to make sure, I did run the same question in Westlaw’s AI legal research tool, and it got it correct that Ginsburg did not dissent in Obergefell.

‘I would hope the authors of this Stanford Institute for Human-Centered Artificial Intelligence (HAI) paper go back and redo the benchmarking on the tools before this preprint study gets released.’

In short, the academic researchers asked the wrong body of information and so got wonky responses. And, as Lambert states, he then went back and ran the same query through Westlaw, i.e. the main case law site for Thomson Reuters, and then the system did spot the error and get the right answer.

That then raises questions about how they approached LexisNexis as well. We need some legal research experts to do this again, that’s for sure.

One other point that struck this site was whether running deliberately misleading questions that are designed to fail is really a sound way of testing generative AI tools for lawyers? Although, it’s worth mentioning that these ‘fake questions’ were only part of the queries they tried out. Nevertheless, in the example noticed by Lambert the academics had intentionally sought to confuse the AI.

Clearly lawyers will make mistakes in their questions from time to time, but generally they know roughly what they are looking for, or at least aiming at. How often do lawyers type in queries that are totally factually conflicted and so cannot do anything but fail? 1% of the time? 20% of the time? More? It’s not clear.

The group also doubts the usefulness of RAG – Retrieval Augmented Generation, i.e. where a reliable source in a database is checked to ensure the response from the genAI system is real and not made up. This can include a specific citation of the source to prove the validity.

(And it’s true that RAG is not 100% perfect, at least based on other papers that AL has seen.)

For example, they state:

‘There are several reasons that RAG is unlikely to fully solve the hallucination problem (Barnett et al., 2024). Here, we highlight some that are unique to the legal domain.

First, retrieval is particularly challenging in law. Many popular LLM benchmarking datasets (Ra- jpurkar et al., 2016; Yang et al., 2018) contain questions with clear, unambiguous references that address the question in the source database. Legal queries, however, often do not admit a single, clear-cut answer (Mik, 2024). In a common law system, case law is created over time by judges writing opinions; this precedent then builds on precedent in the way that a chain novel might be written in seriatim (Dworkin, 1986). By construction, these legal opinions are not atomic facts; indeed, on some views, the law is an “essentially contested” concept (Waldron, 2002).

‘Thus, deciding what to retrieve can be challenging in a legal setting. At best, a RAG system must be able to locate information from multiple sources across time and place in order to properly answer a query. And at worst, there may be no set of available documents that definitively answers the query, if the question presented is novel or indeterminate.’

And also:

‘Second, document relevance in the legal context is not based on text alone. [And] Third, the generation of meaningful legal text is also far from straightforward. Legal documents are generally written for other lawyers immersed in the same issue, and they rely on an immense amount of background knowledge to properly understand and apply.’

And those are fair points.

However, Lambert’s point that if you’re testing the quality of a genAI system for case law, which is using RAG, then you need to actually measure the performance of the tool actually designed for case law, remains equally valid.

As to their assessment of the Lexis + AI system, this site does not claim to be an expert in conducting case law searches, and given the failings of how the study approached Thomson Reuters’ offering, AL will have to leave judgment here up to those who are experts in conducting legal research to double check the results and in effect conduct a parallel study.

Conclusion:

The study was clearly driven by good intentions, but overall, and in particular in their public statement titled: ‘AI on Trial: Legal Models Hallucinate in 1 out of 6 Queries’, it feels like they’re jumping to conclusions a little too quickly. And that’s a shame, as this is important work that needs to be done. We need to share more factual information about how genAI works and how it can help lawyers. But, it needs to be valid.

AL is asking various parties for a comment and will add them to this piece as they come in.

Here is the main paper so you can have a look for yourself – LINK.