How Would You Define ‘Accuracy’ When It Comes to AI?

The LITIG-backed AI Benchmarking project is looking for budding Dr Johnsons to provide feedback on a glossary of terms related to AI accuracy. Understandably, they want some agreed common ground on definitions before setting out their stall, so any thoughts are welcome.

One of the big challenges with developing an approach to AI accuracy in legal work is the broad way in which it can be interpreted, given that there are multiple use cases for AI, and each may have different criteria in terms of expected outcomes.

Dr Johnson, creator of the first English dictionary, hard at work on AI definitions.

But, that’s just the start. Agentic AI is a field that is developing so rapidly it’s hard to find a single definition people can agree on. In fact, even the idea of benchmarking itself is open to debate, e.g. should the benchmark be based on ‘answer usefulness’ to a lawyer on each matter? Or should a benchmark be based on a widely-shared standard test, e.g. a Bar exam?

See definition examples below taken from the LITIG Benchmarking Draft Glossary:

Accuracy: A measure that describes the proportion of correct and relevant results among the total number of tests conducted. In some cases, a response may be partially accurate and any accuracy measures should make it clear how accuracy is determined. AI systems may also have different measures of Accuracy for different Use Cases.

Agentic AI: A type of AI system that can make decisions, initiate workflows, and otherwise act autonomously with within defined parameters with limited human supervision.

AI Benchmarking: The process of evaluating and comparing the performance of different AI systems or models to identify which performs best for specific tasks.’

Meanwhile, John Craske, Head of Innovation at CMS, who has been spearheading the project, told Artificial Lawyer that the LITIG initiative is making good progress and there have been several meetings of the working group to develop the ideas set out in the earlier workshops.

He added: ‘We welcome thoughts and feedback on the Glossary and these definitions, especially when considering how you and those around you talk about and adopt AI. We will collate all the feedback and review it over the next month or so to inform the next version. 

Note: if not already part of the main project, you can send in your thoughts via the LITIG Benchmarking group’s LinkedIn page here – where you can also see posts that include the entire set of terms.

One of our next steps is to create an online space where we can publish this glossary (probably in a Wiki / GitHub type format so it is more useful and easier to maintain) and store other information related to this initiative.

The working group are now turning their attention to defining principles for transparency of AI tools. That will include a high-level transparency charter and a more detailed template statement or ‘model card’, which will set out the information expected from tool vendors.’

So, there you go. Progress. Meanwhile the Vals benchmarking project in the US is also progressing.