AgentEval Launches Open-Source AI Benchmarking Initiative

AgentEval is a new initiative to provide the legal market with an open-source collection of genAI benchmarks that can be freely used, and also is building a community to share data, ideas and protocols for evaluating legal AI tools.

Heading the project is Darius Emrani, who is the CEO of Scorecard, a startup focused on supporting LLM-based product development. He told Artificial Lawyer: ‘The idea is to provide a lot of benchmarks and best practices. We want to get people involved and become the trusted source for AI benchmarks.’

This site asked why he’s chosen to focus on legal tech. Emrani said that the idea is to help ‘essential services’, and that includes health and finance, with AI needs. Central to that is assessing accuracy – or developing best practices around accuracy – and so that’s where AgentEval comes in.

He added that the agentic aspect of AI was also a key component in terms of sectors understanding accuracy.

Emrani also stressed that the goal is to remain open-source and community-driven, and (see community access link) he would be keen to engage with LITIG and other projects around the world focused on legal AI benchmarking.

The organisation stated that: ‘Because some benchmarking efforts often rely on proprietary datasets, closed methodologies, and restricted access, it can be difficult for researchers and developers to reproduce results, compare models fairly, and refine systems.

‘At the same time, we’ve seen successful open evaluation frameworks — from NIST and ISO standards to initiatives like MLCommons, LLMSYS Chatbot Arena and LegalBench — showing that collaborative, open-source approaches lead to better benchmarking practices.’

They went on to say that this helps:

  • ‘Law firms — Gain a clear, standardized way to compare legal AI solutions and select the best tools for their needs.
  • Legal AI vendors — Understand their performance relative to competitors and improve their models based on objective, industry-standard benchmarks.
  • Academics & Policymakers — Access insights into how AI systems perform in real-world legal applications, ensuring responsible deployment and regulation.
  • The broader AI industry — By making benchmarks and best practices freely available, Agent Eval gives smaller startups, research institutions, and independent developers access to the same high-quality evaluation resources as well-funded private companies.’

And here’s a summary of their goals:

Why we built AgentEval.org

Mission

To establish a trusted, open source for sharing AI benchmarks and best practices that drive transparency and continuous improvement in AI evaluation.

Vision

A future where open collaboration and shared data standards accelerate responsible AI innovation, making evaluation methodologies accessible and verifiable for everyone.

Why Open Benchmarks?

Transparency & Trust

Open-sourcing our benchmarks and methodologies allows anyone to inspect, validate, and contribute to our evaluation processes.

Community-driven innovation

An open platform invites contributions from a broad community, leading to more robust and diverse evaluation practices.

Industry adoption

Open-source tools and standards are more likely to be adopted by academic institutions, industry players, and public agencies.

Non-profit and collaborative alignment

Emphasizing open source aligns perfectly with our mission to move the industry forward through shared knowledge and collective effort.

You can find more info here.

Why This Matters – The AL View

The need for some clarity on genAI accuracy came to the attention of everyone last summer after the ‘Stanford Debacle’, when a group of researchers claimed to have exposed major failings in well-known genAI legal research tools.

Since then things have evolved. Artificial Lawyer, and others, raised the idea of setting up some sort of shared approach to genAI accuracy. AL suggested this could range from a ‘Kite mark’ to a set of protocols to help buyers and sellers approach this issue.

LITIG and others have got things in motion, and private companies are getting involved as well. But, the challenge remains: what are we trying to achieve here? What does ‘there’ look like when it comes to gauging AI accuracy? Is it just a single benchmark test, or several tests, owned by one entity? Is it a basket of different benchmarks? Is it open-source and free for all to use?

Moreover, as AL has suggested, do we need to think more in terms of a ‘compass’ approach? I.e. that because foundation models are moving so fast and that any company scoring X on any specific benchmark made by Y entity will immediately try to refine their outputs to improve them – as happened after the Stanford study – then is there any lasting value relying on just a single test approach to benchmarking?

Moreover, one of the most insightful things that came out of the Stanford fallout was the point made by Jeff Pfeifer at LexisNexis about the need to focus on ‘answer usefulness’. I.e. to some extent the value of any AI output is in the ‘eye of the beholder’.

So, perhaps instead what we need is a general direction to head towards, and an idea of what ‘good’ looks like based on a basket of different benchmarks. I.e. for X type of task, we should expect Y level of results, with those outputs put in the context of the task, e.g. case law research is different to summarising testimony, and both are very different to red-lining a basic NDA. But the overall approach is holistic, open-source, and community-driven.

Does that mean that we don’t need benchmarks? No. Not at all. Quite the opposite. Rather the point is that legal work is subjective because it’s all based on language, reasoning and interpretation. Measuring AI accuracy is not like measuring the speed of a car going down a road. So, we need multiple pathways and viewpoints to assess accuracy and therefore a more principle-based approach that accommodates this is more flexible and can evolve as the sector evolves as well.

To conclude, AL welcomes this open-source, community-based approach, that seeks to build a general consensus on what ‘good’ looks like, along with protocols for how to approach genAI tools. Within that ecosystem of evaluation are benchmarks – but they are a multi-prong approach to assessment – and it’s accepted, in fact expected, that the results for each company will evolve as rapidly as the state of the art is. Hence, we return to the compass approach.

Any road, that’s AL’s view. There is much more to come in terms of genAI accuracy benchmarking. Watch this space!

Richard Tromans, Founder, Artificial Lawyer