
Vals AI, the US-based company providing genAI performance testing, has published its first study of how several legal tech companies responded to a series of tests set for them by major law firms, including Reed Smith and Fisher Phillips.
The companies to have their results shared include: Harvey, Thomson Reuters’ CoCounsel, Vecflow, and vLex. The human lawyers who acted as a fleshy comparator were provided by ALSP Cognia. (Note: there are AL TV Product Walk Throughs of Vecflow and vLex to get some more context – here.)
Vals’ proprietary ‘auto-evaluation framework platform’ was then used ‘to produce a blind assessment of the submitted responses against the model answers’.
Results Overview
And here is an overview of the Vals results, which AL is sharing here verbatim:
‘The seven tasks evaluated in this study were Data Extraction, Document Q&A, Document Summarization, Redlining, Transcript Analysis, Chronology Generation, and EDGAR Research, representing a range of functions commonly performed by legal professionals. The evaluated tools were CoCounsel (from Thomson Reuters), Vincent AI (from vLex), Harvey Assistant (from Harvey), and Oliver (from Vecflow). Lexis+AI (from LexisNexis) was initially evaluated but withdrew from the sections studied in this report.
The percentages below represent each tool’s accuracy or performance scores based on predefined evaluation criteria for each legal task. Higher percentages indicate stronger performance relative to other AI tools and the Lawyer Baseline.
Some key takeaways include:
- Harvey opted into six out of seven tasks. They received the top scores of the participating AI tools on five tasks and placed second on one task. In four tasks, they outperformed the Lawyer Baseline.
- CoCounsel is the only other vendor whose AI tool received a top score. It consistently ranked among the top-performing tools for the four evaluated tasks, with scores ranging from 73.2% to 89.6%.
- The Lawyer Baseline outperformed the AI tools on two tasks and matched the best-performing tool on one task. In the four remaining tasks, at least one AI tool surpassed the Lawyer Baseline.
Beyond these headline findings, a more detailed analysis of each tool’s performance reveals additional insights into their relative strengths, limitations, and areas for improvement.
Harvey Assistant either matched or outperformed the Lawyer Baseline in five tasks and it outperformed the other AI tools in four tasks evaluated. Harvey Assistant also received two of the three highest scores across all tasks evaluated in the study, for Document Q&A (94.8%) and Chronology Generation (80.2%—matching the Lawyer Baseline).
Thomson Reuters submitted its CoCounsel 2.0 product in four task areas of the study. CoCounsel received high scores on all four tasks evaluated, particularly for Document Q&A (89.6%—the third highest score overall in the study), and received the top score for Document Summarization (77.2%). For the four tasks evaluated, it achieved an average score of 79.5%. CoCounsel surpassed the Lawyer Baseline in those four tasks alone by more than 10 points.
The AI tools collectively surpassed the Lawyer Baseline on four tasks related to document analysis, information retrieval, and data extraction. The AI tools matched the Lawyer Baseline on one (Chronology Generation). Interestingly, none of the AI tools beat the Lawyer Baseline on EDGAR research, potentially signaling that these challenging research tasks remain an area in which legal AI tools still fall short on meeting law firm expectations.
Redlining (79.7%) was the only other skill in which the Lawyer Baseline outperformed the AI tools. Its single highest score was for Chronology Generation (80.2%). Given the current capabilities of AI, lawyers may still be the best at handling these tasks.
Scores for the Lawyer Baseline were set reasonably high for Document Extraction (71.1%) and Document Q&A (70.1%), but some AI tools still managed to surpass them. All of the AI tools surpassed the Lawyer Baseline for Document Summarization and Transcript Analysis. Document Q&A was the highest-scoring task overall, with an average score of 80.2%. These are the tasks where legal generative AI tools show the most potential.
EDGAR Research was one of the most challenging tasks and had a Lawyer Baseline of 70.1%. In this category, Oliver was the only contender at 55.2%. Increased performance on EDGAR Research—a task that involves multiple research steps and iterative decision-making—may notably require further accuracy and reliability improvements in the nascent field of “AI agents” and “agentic workflows.” For details on AI challenges, see the EDGAR Research section.
Overall, this study’s results support the conclusion that these legal AI tools have value for lawyers and law firms, although there remains room for improvement in both how we evaluate these tools and their performance.’
Note: LexisNexis, which had engaged with the project team ‘withdrew’. In a statement to Artificial Lawyer, the company said: ‘The timing didn’t work for LexisNexis to participate in the Vals AI legal report. We deployed a significant product upgrade to Lexis+ AI, with our Protégé personalized AI assistant, which rendered the benchmarking analysis out-of-date and inaccurate.’
–
What Now?
AL spoke to Vals team member Rayan Krishnan, and also Tara Waters, who has been part of the project and is now working as a consultant.
Krishnan noted that it had been like a major ‘diplomatic mission’ to get everyone involved on board, but that he was very pleased with the end result, given that so many leading law firms had taken part. He added that the hope now is to get more legal tech vendors involved as well, and to also focus on legal research next. There is also a plan to cover other countries – as this one was for the US market, with the UK a possible next target.
The idea is to publish a major study perhaps every year on the legal AI sector. When asked whether he felt that having multiple legal AI benchmarking projects was a good thing, he said: ‘We expect individual groups to keep doing studies. There are a lot of parallels in academia.’
AL also asked if there was a risk of an ‘empirical fallacy’ – i.e. because you’re looking for X result, you find X result. Krishnan noted that the questions came from the law firms, and those tasks represented real-world needs they wanted to see addressed, so this wasn’t tilted especially in the legal tech vendors’ favour.
Waters added: ‘The questions were not designed for AI tools, the law firms chose what they were most interested in.’
More broadly, Krishnan said that this benchmarking study, and others like it, are essential as ‘among law firms there is a lot of mistrust in the market. Many are confused about hallucinations and other issues’.
He also stressed that firms are suffering from ‘pilot fatigue’ – i.e. getting a law firm to test in detail many different tools was exhausting and so a benchmarking approach by a third party like theirs was really helpful.
And what is the long-term hope here?
‘My hope is that there will be more adoption [of legal AI]. I want legal services to be more efficient,’ he said and added that this means there will be better access to justice and disputes can be solved more easily.
‘A benchmark is very motivating: vendors will all wonder how they stack up,’ he concluded. I.e. when performance is made public then vendors will endeavour to do better and reach higher levels of accuracy and response usefulness. And that in turn drives adoption. And that then leads to greater efficiency across the legal market.
Waters added: ‘We need benchmarking and transparency to build trust because law firms are unsure how to measure accuracy. [Overall] I am super-excited to be involved in this, and we need to be involved in assessing real world applications and bring together the right stakeholders.’
Note: the Vals legal AI project was also strongly supported by Nikki Shaver, Jeroen Plink and team, and also engaged with John Craske at CMS who is part of the LITIG benchmarking project.
Kudos to all involved.
—
And here is the link to the full report.