Last week, the first meeting of the genAI Benchmarks Initiative took place at the London office of CMS. Its goal: to scope out a path for standardising accuracy assessments for genAI legal tools. Here are some of the key points.
Overview
The meeting was made up of around 40 people, selected from a larger group who had shown an interest in the project. They represented law firms, inhouse teams, legal tech companies and other elements of the legal sector.
The initiative was triggered by the Stanford legal research tool study debacle, which blew a hole through the idea that all genAI tools are hitting the levels of accuracy (and lack of hallucinations) most lawyers would expect.
After calls from Artificial Lawyer and others for an industry consensus on genAI accuracy in the wake of the Stanford study’s fallout, John Craske, Head of Innovation at CMS, with the backing of legal tech group LITIG, has put together this project.
This initial meeting was a way of ‘testing the water’ and figuring out how much support there was for real change, i.e. setting standard measures and benchmarks for accuracy. In a scene-setting introduction, this site’s founder, Richard Tromans, provided the background to the challenges we face as a sector, namely that lawyers are expected to wield a range of genAI tools, yet there is no shared approach on how to measure accuracy, or what the results should look like.
As the introduction pointed out, not long after ChatGPT arrived in November 2022, it became clear that genAI had issues with accuracy and hallucinations. However, the potential productivity gains of genAI, and its broad usability, have been so great that demand to use it has only grown – so too the number of genAI-based startups. Yet, the accuracy issue has remained all the same.
In short, genAI tools are only going to be used more and more – yet, as a legal sector there are no standards, no benchmarks, no shared approaches for measuring what these tools do. This puts all the risk on the buyer, as well as the cost of running POCs and tests not just to see how lawyers use the tools, but just to figure out if it works well or not. This seems to be an unsustainable situation.
Overall there was strong support for change from the room.
Benchmark Options
As an earlier Artificial Lawyer piece set out, there are perhaps four levels of engagement here in terms of what we can achieve, ranging from a loose agreement to share information between firms and inhouse teams, to creating a formalised approach that is shared across the sector, to at the highest level developing a kitemark (i.e. a quality badge that is maintained, perhaps by a new body.) See the earlier thoughts on this in the notes below.
Craske and colleagues have now boiled the four steps down to: ‘(1) a transparency commitment; (2) agreed methodologies; (3) defined use cases; (4) kitemark / third party verification.’
When asked which one people preferred to aim for, most said 2 or 3. This is because moving from zero to a kitemark would be very hard to achieve. But, moving to a shared commitment and/or shared methodology of some type, does seem possible and from there we can perhaps at some point in the future reach level 4.
The central challenge is actually to do something and there was clearly support for a concrete outcome. So, this is really positive.
As Craske concluded: ‘In terms of next steps, we are now digesting all of the notes, feedback and flip chart scribbles and we will pull those into something we can all use. As promised, we are also planning to host a virtual session for those people who couldn’t make it to London – as soon as we have a date pencilled in, we’ll let you all know. At the virtual session I would like to playback the output from the in-person workshop, sense-check what we found and build on the work we did in terms of a potential consultation paper.
‘Once we have held the virtual session, we will be forming a working group (perhaps ‘core’ and ‘halo’ groups) to work on drafting a consultation paper.’
Key Challenges to Address
While there was broad support for action, there are plenty of challenges that need attention. Here are some of the topics explored at the meeting:
- Getting the right balance when it comes to use case accuracy – There are clearly dozens of use cases for genAI and each one may have different accuracy needs (e.g. case law research for a specific citation is not the same as looking for ways to redraft a clause, which may have multiple solutions). We cannot have a system that has dozens of different standards, that would be unworkable. So, what is the right balance? Several people suggested that we may need to group use cases together into several core examples, which can then be used as a guide. I.e. if we had, for example, five main types of use cases, the standard measures needed for them, and then the accuracy benchmarks to fit, that would get us to a realistic foundation.
- Who should be doing the accuracy measuring? – Some suggested that the vendors clearly had to be responsible for doing the testing and showing the results – relative to whatever standards are developed. And it seems that whatever the outcome, the vendors will have to play ball here, otherwise it means buyers will still have to carry all the risk and the testing burden – unless a third party testing body is set up. However, one view was that smaller startups would not be able to handle the tests. But, this site would point out that it’s actually been the smaller companies, such as Screens and Paxton AI, which have done accuracy tests and published the results, while in fact it’s the larger ones that have avoided doing so. Plus, all vendors will be conducting internal tests – or should be – so, publishing them is not an onerous task, it’s rather that some just don’t want to….yet. (P.S. it was also discussed whether we need a shared protocol for buying legal genAI tools to ensure that vendors meet high standards of marketing transparency when it comes to sharing accuracy information and that buyers feel empowered to ask about such things in a structured way.)
- What would this eventually look like? Once we have agreed on standard measures and benchmarks for several core use cases, what then? How does this get presented? That’s a good question and one we’ll need to work on. One suggestion was that there should be a type of free and public directory that ranked all of the genAI tools, showing their accuracy (in reference to whatever standards are agreed to) across the various core use cases. And that sounds like a good idea, although then we have the challenge of who or what manages this. But, these are early days and this may well be achievable.
Conclusion
Overall, the energy and positivity in the room was excellent. There was a genuine feeling among the attendees that something should be done – and most of all, that something could be done.
Let’s give the last word to John Craske, who has made this a reality: ‘It was great to host 40 people from across the industry to discuss what we can do to improve transparency and trust in legal AI and ultimately drive responsible adoption. Thanks to everyone who came along and to Richard for helping set the scene. Everyone there agreed this was something where we need to come together across the legal industry and do something practical that benefits everyone.
‘Next steps will be to hold a virtual session for those people who couldn’t make it to London and then we’ll be looking to put together a working group to draft a consultation paper.’
—
Notes on the 4 possible steps. Here is something AL did earlier in July, which feeds into the current pyramid of goals:
‘Here are four scenarios:
Level 1 – There is a very constructive conversation. Ideas and information are shared, parties agree to be transparent on the tests they conduct to help others develop their own benchmarks, and everyone comes away better for it. Some vendors also promise to be more transparent about their genAI tool performance. There is no formal outcome, in part because these are all individual businesses and getting a large-scale agreement is not easy, but people will keep in contact, and knowledge levels will be higher. Overall, a helpful improvement.
Level 2 – A formal agreement is made and all the various parties agree to 1) a set of benchmarks and the supporting standards for them, 2) to share information based on tests and use of tools they have at their firm or inhouse, and 3) the vendors also buy into this and also commit to transparency, as this may well boost their sales. But, although an organised network (see below) may have helped to convene things, there is no ‘regulator’ or ‘standards body’ standing behind this, as such. It’s an ‘alliance of the willing’, kept going by individuals who wish to sustain it.
Level 3 – A body of some type either takes on this need as part of its formal role, or such a formal body is created, to take on the responsibility to sustain these standards and benchmarks, to drive forward transparency and communicate new developments, all for the good of the industry. They could even act as a testing centre, examining products and then publishing their results. Many such bodies exist across the economy in various areas, but they need some level of support from market participants to function.
Level 4 – One other outcome, which is probably the least likely here, is that a truly formal body is created that becomes more like a regulator, or is in fact actually a regulator, which is not just a promoter of benchmarks and standards, but actively enforces them, with penalties for those vendors who do not adhere to the rules, which perhaps may simply be the withdrawal of a ‘kite mark’ or ‘quality standards badge’ the body can award or take back each year.’
See more here.
Thanks to everyone who has taken part and shown an interest.