Stanford HAI Tests Westlaw, But The GenAI Results Look Worse!

OK, this story is getting into unusual territory now. Artificial Lawyer just got an email from the spokespeople for the Stanford University HAI team who told this site the researchers had updated their genAI study of hallucinations in case law tools to include Thomson Reuters’ Westlaw. And guess what….? Westlaw has come out even worse than the Practical Law tests (see below) according to what they have published in an updated paper.

Here is the new statement to AL from HAI: ‘Letting you know that the research and blog post have been updated with new findings. The study now includes an analysis of Westlaw’s AI-Assisted Research alongside Lexis+ AI and Ask Practical Law AI.’

They have updated the HAI group’s findings here to reflect this.

As you may remember, this whole thing started when a group of researchers tested whether LexisNexis’s and Thomson Reuter’s genAI tools were as good as hoped for case law research. There was plenty of confusion caused when the team tested Practical Law, rather than Westlaw for the case law questions. They have since been given access to Westlaw and hence the new results.

In the updated results (see the table below, Figure 4), the genAI study of case law questions saw Practical Law return 17% hallucinations, BUT, the newly tested Westlaw got 33% hallucinations overall (left side table). I.e. the results are worse now than before when considering Westlaw, which is meant to be the main case law database.

Updated HAI results, May 30, showing with Westlaw added.

And here are the original results, without Westlaw from last week.

Results from HAI last week, without TR’s Westlaw. Where it says ‘Thomson Reuters’ this refers to testing the queries on Practical Law.

Here is the link to the original story in Artificial Lawyer, and there are two more articles with comments that follow it that give more context – please see the AL site.

It’s late here in the UK, so, AL is going to park the results and the updated research paper here and you can see if it makes sense. (Link to updated paper is above.) AL has not had time to get a response from TR yet, but they are very welcome to send one in and it will be added to this piece.