Thomson Reuters Contradicts Stanford GenAI Study – ‘We Are 90% Accurate’

Thomson Reuters has contradicted the findings of a recent Stanford HAI study into its genAI research capability within Westlaw Precision, and stated ‘our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it’.

The claim, which is part of a piece written by Mike Dahn, head of Westlaw Product Management, was made in response to the Stanford University HAI study in May that found the results from its Westlaw genAI product were highly problematic, (see below).

The results of the Stanford study show an accuracy rate for the Westlaw genAI product of only 42% and an overall hallucination rate of 33% – see here.

Stanford HAI data, May 2024.

Meanwhile, as noted, TR says they are at ‘90%’ accuracy, and they also avoid discussing in detail the appearance of hallucinations in research results. But, they also then state in their response that they are open about errors taking place:

AI-Assisted Research uses large language models and can occasionally produce inaccuracies, so it should always be used as part of a research process in connection with additional research to fully understand the nuance of the issues and further improve accuracy.

‘We also advise our customers, both in the product and in training, to use AI-Assisted Research to accelerate thorough research, but not to use it as a replacement for thorough research.’

The company then goes on to thank the Stanford team and also the idea of benchmarking genAI legal solutions.

However, they do not really back down at all on the key message: they are accurate, no matter what Stanford has said. Their statement then underlines how they have been very thorough in how they test:

‘Last year, we released AI-Assisted Research in Westlaw Precision to help our customers do their legal research much faster and better…Prior to releasing it, and since its release, we test it rigorously with hundreds of real-world legal research questions, where two lawyers graded each result, and a third, more senior lawyer, resolved any disagreements in grading.

‘We also tested with customers prior to release – first explaining that AI solutions should be used as an accelerant, rather than a replacement for their own research. In our testing with customers their feedback was extraordinarily positive even when they found inaccuracies themselves. They would tell us things like, “this saves hours of time” or “it’s a game-changer.” We’ve never had a more exciting update to Westlaw.’

Overall, the TR statement is something of a defiant response showing confidence in the product, while also balancing this with the additional point that they know errors appear and that AI is ‘an accelerant, rather than a replacement’. However, they are bullish on the accuracy level, i.e. 90% for TR’s tests vs 42% for the Stanford test, plus plenty of hallucinations.

What Does It All Mean?

Primarily TR’s comments send the message to the market that whatever Stanford is saying, it’s wrong to make the conclusions it did.

Not that TR had much choice here. How can a company that serves lawyers sell a product that is only ‘42% accurate’ and hallucinates a lot as well? But, TR says it tests a lot with lawyers and the accuracy is actually very high. So, it IS good to use.

Go figure. And, perhaps only you – the user of the product – can ever really get to the answer as you are the ones using it for real work needs. Although, objective benchmarks for genAI that the legal community agreed to would go a long way to helping, especially before firms bought this software.

This site has plenty of additional questions for TR and will be sending them today and hopefully getting a fuller response. And it’s worth mentioning that when AL asked for a statement in May about the Westlaw test by Stanford this site was sent this general statement, plus a smaller comment that stemmed from that same statement.

P.S. one point many people have wondered about: ‘How much of this is basically Casetext?’ has still not been answered directly. However, it seems fairly clear than before the deal with them that TR didn’t have extensive genAI capabilities for case law research and after the deal it did. But, AL will keep asking TR for clarity on this.

Conclusion

It’s understandable that TR wants to draw a line under this and move on. The problem is that the Stanford study is just so radically far apart from TR’s own findings. There are still several questions to answer. AL will keep asking them.