After in-person and online meetings to develop thinking around how to approach genAI accuracy, the LITIG AI Benchmarking project is now moving to the next stage. (Plus, below are some additional thoughts from this site about where to head to, see: The Need For A ‘GenAI Compass’ Approach.)
As set out below by John Craske, Head of Innovation at CMS, who has spearheaded the LITIG project, here are some key points about next steps:
‘We have now formed a working group to take these outputs [from the meetings around issues related to benchmarking genAI accuracy] to the next level and to draft a consultation paper.
This consultation paper will then be shared with [members of the project] for feedback, before sharing with the industry more widely. Our goal is to have the first outputs in circulation by the end of December – likely around transparency commitment [which relates to the goal of having a clear, shared approach to the sale and deployment of genAI tools].
The working group includes members from law firms of different sizes, as well as large and small legal tech vendors. We have also now created a LinkedIn Group: Litig AI Benchmark group to allow for easier sharing and discussion.
Finally, we are also continuing to explore opportunities to align and collaborate with others to spread the load and avoid any duplication – nobody needs multiple legal AI benchmarks!
These include:
- Michael Kennedy and Addleshaw Goddard’s experiences of seeking to improve accuracy of genAI outputs, such as through refinement of prompts.
- Neel Guha at Standford about LegalBench
- Megan Ma, Associate Director at CodeX at Stanford Law School
- Sarah Chambers, Head of Strategy and Engagement at Ashurst about Vox PopulAI
- Rayan Krishnan and Tara Waters at ValsAI about a ‘Vals Legal AI Report’. Note: As seen in recent announcements, ValsAI are progressing well with their benchmarking study in the US. We are investigating ways we could work together and support each other to achieve our aligned goals.’
—
The Need For A ‘GenAI Compass’ Approach
So, there you go. Clearly this is going to involve a lot of work and the shared engagement of plenty of stakeholders. But, progress is being made and it’s great to see so many different parties getting involved.
Overall, it’s this site’s view that while it may be impossible to define exact accuracy levels for each specific use case on an industry-wide basis – given how subjective each task can be for each individual user, even down to how they are prompting – developing a shared awareness and understanding of what can be generally achieved in a number of core use cases makes sense. That then gives buyers of genAI tools something to shape their assessment of the outputs of this technology and avoids unnecessary misunderstandings around what is possible.
I.e. having a fixed accuracy benchmark for a broad use case that must then be applied to every particular tool that provides results in that area is tricky to do. But, we can still improve genAI transparency by:
- Defining multiple core legal use cases and the key outputs we are expecting from the genAI in each one. E.g. a summarisation output works with a different type of accuracy expectation to a single case law research query, which in turn are both different to using genAI to leverage a playbook to modify a clause during contract review. I.e. there is no single ‘genAI accuracy’, but rather many different ones. (And there may well be very different approaches within the application layer of any tool to achieve those results as well.)
- Once we have defined use cases and what the output goals are, we can then estimate a general expectation of accuracy for those specific needs, but as noted, they will need to be flexible and broad; and also take into account that the ‘state of the art’ is evolving month by month. E.g. new LLMs with better reasoning capabilities are clearly a work in progress, while better system-prompting and RAG development is also changing accuracy across multiple use cases. (The work of Addleshaw Goddard on pushing up genAI accuracy around due diligence, is a good example of how things can be shifted in a short time. Meanwhile, we should expect foundational LLMs to keep improving over the medium-to-long-term.)
- So, from this site’s point of view, even if an evaluation system can say today that X tool has Y accuracy at Z task, it’s at best an approximation, as each user’s particular application of the genAI to a piece of legal work will vary, and the underlying tech of each tool will also change over time. In short, that is like creating a map where the landmarks are constantly shifting and thus cannot be relied upon for long.
- To create something that can last, any shared benchmark will likely be ‘direction-based’ rather than an absolute picture of a fixed end result. Along with this will need to be a shared approach, in effect a protocol, that increases the awareness of how to consider genAI outputs, ranging across multiple use cases and the general accuracy expectations for each.
- In short, we need to develop a compass that guides us over the genAI terrain as this technology and its applications evolve.
Any road, as noted, we are making progress, just as the genAI developers are making progress as well. Kudos to all involved at LITIG and at the many other projects around the world.
Richard Tromans, Founder, Artificial Lawyer