Following the story on the controversial study ‘Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools’, conducted by the Human-Centred AI group or ‘HAI’, within Stanford University, the organisation’s spokespeople have sent a comment to Artificial Lawyer.
See the main story here, and responses from Thomson Reuters and LexisNexis here.
In particular, in their message, Stanford addressed how they came to use Thomson Reuters ‘Ask Practical Law AI’ for creating a performance benchmark for its GenAI research abilities, rather than focus on GenAI outputs connected to the main Westlaw primary case law collection.
Their comment to Artificial Lawyer is here, (emphasis added):
‘The Stanford study acknowledges Thomson Reuters also offers a product called “AI-Assisted Research” that appears to have access to additional primary source material as well (Thomson Reuters, 2023). However, the research notes this product is not yet generally available, and multiple requests for access were denied by the company at the time the researchers conducted the evaluation.’
They don’t make any reference to LexisNexis in their message to this site. So first, let’s take a look at the bit in the research that refers to the choice of what to study within Thomson Reuters.
Below is a section of the research paper from page 9, with emphasis added by AL, which relates to their request for access:
‘Thomson Reuters’s Ask Practical Law AI, offered on the Westlaw platform, is a more limited product, but it operates in a similar way. Like Lexis+ AI, Ask Practical Law AI also functions as a chatbot, allowing the user to input their queries in natural language and responding to them in the same format.
However, instead of accessing all the primary sources that Lexis+ AI uses, Ask Practical Law AI only retrieves information from Thomson Reuters’s database of “practical law” documents— “expert resources . . . that have been created and curated by more than 650 bar-admitted attorney editors” (Thomson Reuters, 2024).
Thomson Reuters markets this database for general legal research: “Practical Law provides trusted, up-to-date legal know-how across all major practice areas to help attorneys deliver accurate answers quickly and confidently.” Performing RAG on these materials, Thomson Reuters claims, ensures that its system “only returns information from [this] universe” (Thomson Reuters, 2024), reducing “hallucinations to nearly zero” (Ambrogi, 2024).
Thomson Reuters also offers a product called “AI-Assisted Research” that appears to have access to additional primary source material as well (Thomson Reuters, 2023).
However, this product is not yet generally available, and multiple requests for access were denied by the company at the time we conducted the evaluation. Both products are made available via the Westlaw platform and are commonly also referred to as AI products within Westlaw.’
So, to recap, Stanford’s HAI team is saying they know there is a difference between the Practical Law product and what is called ‘“AI-Assisted Research” that appears to have access to additional primary source material as well’.
They asked Thomson Reuters for access to this tool ‘multiple’ times and were denied access. So, if they wanted to study Thomson Reuters’s GenAI research capabilities they would have to do it without ‘access to additional primary source material’ – which is a challenge if you are asking case law questions.
And Thomson Reuters also referred to this in their comment to this site yesterday as well, when they added: ‘To help the team at Stanford develop the next phase of its research, we have now made this product available to them.’ QED – they did indeed not give full access to the researchers before.
Also, in the footnotes on the same report page in reference to the last line of page 9, (which reads ‘Both products are made available via the Westlaw platform and are commonly also referred to as AI products within Westlaw’) – the researchers add: ‘Footnote 12 – For transparency, we made this clarification upon request by Westlaw and Thomson Reuters representatives, who consider Westlaw and Practical Law distinct product lines.’
I.e. the researchers were not just denied access to the product they wanted to explore at the time, it seems they were in contact with Thomson Reuters at that point on aspects such as how to define its products.
So, in terms of what this means, there are two key points to consider:
- Did the researchers understand that what they were testing was something that was perhaps not the best fit for the types of case law questions they were asking?
- If what they wanted to study was not ‘generally available’ and they had been denied access, could they have waited until they could get access before doing the tests and putting their results out?
This site has a lot of sympathy for wanting to press ahead. No-one likes to be told they can’t access information or products. But, clearly it had an effect on the results.
As seen in the table below, one can see how very different the Thomson Reuters and LexisNexis results are. If you knew the two legal data giants well, with each so heavily focused on product development, one might assume that it was quite unusual to be so far apart on outputs. Did the very different results send up a red flag at the time of the research? E.g. the results below state that Thomson Reuters was only ‘accurate’ 19% of the time. It’s hard to think that such a major company would put out a product that had such weak results. And of course, in reality, it didn’t, because it was simply a case that the tool tested didn’t fit the questions asked.
With regard to LexisNexis, the university’s spokespeople have made no comment to Artificial Lawyer. However, it’s worth mentioning that yesterday, Jeff Pfeifer, Chief Product Officer for LexisNexis North America and UK, said:
‘LexisNexis has not been contacted by Stanford’s Daniel Ho [from the group that did the study], and our own data analysis suggests a much lower rate of hallucination.’
AL went back to the footnotes on page 9 of the research paper and found something else. Footnote 10 of the paper states:
‘Since the completion of our evaluation for this paper in April 2024, LexisNexis has released a “second generation” version of its tool. Our results do not speak to the performance of this second generation product, if different. Accompanying this release, LexisNexis noted, “our promise is not perfection, but that all linked legal citations are hallucination-free” (LexisNexis, 2024).’
Now, would it be fair to assume that a ‘second generation’ AI tool may be better than an earlier version? Probably. However, the academics could not wait forever for some perfect version to eventually arrive. They had to test what was there in front of them and then give their results based on what they saw.
This however raises a tricky issue with all types of tech product review: iteration and improvement. For example, GPT-3.5 is less powerful than GPT-4. In a product area where improvement is very rapid, a benchmark study done on an ‘old’ product or perhaps not even an old one, but simply one that has not benefited from 6 months of updates, will not reflect the current reality.
And Pfeifer highlights this in his reply to this site:
‘LexisNexis has extensive programs and system measures in place to improve the accuracy of responses over time, including the validation of citing authority references to mitigate hallucination risks in our product. ….. The solution is continually improving with hundreds of thousands of rated answer samples by LexisNexis legal subject matter experts used for model tuning.’
But, as noted, it’s impossible for anyone to review anything if they keep having to wait until it’s fully improved. At some point a reviewer just has to get on with it, even if what they’re looking at may be better in a year, or even a few months.
Perhaps this is one thing that the researchers could have highlighted more, i.e. to say ‘the version of this product you now might be using may have better performance than the one we tested’? But then, we return to the ‘at some point you just have to take what’s there and give the results as they appear’, even if it’s fair to assume that a GenAI-based product will keep improving and so the study will be ‘out of date’ quite quickly.
Conclusion
As mentioned previously, this was a well-intentioned study aiming to illuminate an area that matters to thousands of lawyers.
It’s clear that the researchers did ask for access to a product that, if they had tested instead, would have probably given different – and one assumes better – results. So, that was an unfortunate outcome for both the Stanford team and Thomson Reuters.
It perhaps underlines the need for a more community-based approach, where the goal is to get information out to the users even if there are perhaps some commercial sensitivities about sharing access to products and performance results.
With regard to LexisNexis, given that their products have moved on already since the study was made, it’s hard to know where those results stand now. Again, public and up-to-date performance data, shared across the sector, would help everyone.
When it comes to GenAI, the legal sector wants to know more than ever what works and what doesn’t, and to what degree. But, right now we still don’t know for sure as there is so little publicly shared, reliable and comparable data to go on when it comes to GenAI performance for lawyers.
Testing for accuracy and other output aspects of GenAI is also a cost for all law firms and inhouse teams – and slows down adoption. But, right now, there is no alternative as there is no industry standard or verifiable knowledge base for this information.
This gap in the sector’s collective knowledge is clearly a huge opportunity for the building of a more collaborative and open approach.
—
[ Photo: Stanford University, California. By RT. ]