Harvey Launches Legal GenAI Evaluation System: BigLaw Bench

Harvey has publicly launched BigLaw Bench, their own methodology for evaluating the accuracy of genAI tools when used on legal tasks, measuring ‘what % of a lawyer-quality work product does the model complete for the user?’ when it comes to the answers.

It’s a bold step, as they make public not only their methodology (see more below), but include their own scores. When measured against general LLMs, such as GPT-4o, Harvey scored 74% for answer responses overall, across transactional and litigation tasks.

Harvey data.

In comparison, GPT-4o – the latest version of OpenAI’s general LLM – gained an answer score of 61% on legal tasks.

The company has also set out a methodology for what they term a ‘Source Score’, i.e. the ability to provide verifiable answers that have correct sources – in effect the model’s RAG capability. In this case, Harvey got an overall score of 68%, while the general LLMs did very badly at sourcing their answers correctly, with GPT-4o getting just 24%, while Claude 3.5 got a very low score of 8%.

Here is how they formulated the evaluation:

‘Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model’s response based on specific criteria essential for effective task completion.

Source Reliability: Assesses the model’s ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.

Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations).’

Those scores are then expressed as percentages, as seen above.

And in terms of the questions that were put to the models and the deeper details of how it worked, see their blog and Github page, (see links below).

But, overall, they looked at legal work, broke it down into transactions and litigation, then broke those two pillars into several core tasks related to each. And this makes a lot of sense, in fact during the recent discussions in London to help develop common standards and benchmarks for legal genAI tools, this site has underlined the fact that there cannot be just one approach for all legal tasks, as legal research is very different to considering the impact of a new piece of legislation, which in turn is very different to re-drafting a clause, and so on.

However, it looks like Harvey has done a thorough job here of breaking down legal work into modal groupings of tasks, i.e. the tasks that appear on average the most, in order to create some order out of the mass of different things a lawyer does.

Here is how they put it:

‘Harvey’s research team developed bespoke rubrics to evaluate each task. These rubrics establish objective criteria that would be necessary for a model response to effectively accomplish a given task. They also penalize common LLM failure modes such as incorrect tone or length, irrelevant material, toxicity, and hallucinations. Combined, these rubrics effectively capture everything a model must do to effectively complete a task, and everything it must avoid to ensure it completes that task in a safe, effective, and trustworthy manner.

In order to convert these criteria into a benchmark, each affirmative requirement was assigned a positive score based on its importance to completing the relevant task. Negative criteria, such as hallucinations, were assigned negative scores as the human effort to correct errors can reduce the utility of an otherwise complete piece of AI work product. Answer score is computed by taking the combination of these positive and negative points and dividing by the total number of positive points available for a task.’

And below are the tasks they set out.

Harvey data.

The company has also stated that it will continue to develop and refine the BigLaw Benchmark and will also work with LLM evaluation company Vals – which by chance Artificial Lawyer spoke to very recently.

What Does This All Mean?

Well, first, it looks like Harvey has gone ahead and done a lot of the work the legal genAI accuracy group set up in London by John Craske at CMS, with support from LITIG, had intended to do. Clearly it makes sense that everyone involved – including this site – takes into consideration what Harvey is proposing in terms of methodology before trying to re-invent the wheel.

The task categories and also the ultimate score system make a lot of sense. Perhaps then the next task of the legal genAI group in London should be to look in-depth at the BigLaw Benchmark and see how far it achieves the desired goals and where it may need to be extended.

Second, it’s great to see what is now a well-known legal AI company being transparent like this. The scores it gave, e.g. 74% for answers, are perhaps lower than some might expect. However, there is a difference between their multi-factor evaluations and a direct accuracy score that simply measures if an LLM finds a piece of information. This is taking on multiple other criteria and measuring how much of a task it does to the expectations of an experienced lawyer – which is a very high bar to test against.

While some might still say ‘That’s not a very high score’, when you step back and consider what it means, the reality is in fact impressive, i.e. this software can recreate legal outputs that are the equivalent – right now – to 74% or more of what an experienced lawyer could do on a task. Plus, this will only get better.

When that score reaches something like 90% in terms of instantly being able to provide the equivalent output of an experienced lawyer, then the market really will have to change. At present, clearly Harvey and other genAI tools are assistants for now. But, in five years….in ten years from now?

Overall, it’s great to see Harvey going public like this. And this builds on the work of companies such as Paxton and Screens, which went public with their accuracy scores and published their methodologies over the summer.

All of this ultimately helps to usher in an era of greater transparency for legal AI, and that’s a very good thing for everyone. Check out the BigLaw Benchmark and see what you think.

Link to blog

Link to Github page

P.S. the company acknowledged the work of many team members in this project: Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra.

Legal Innovators UK Conference

If this subject is of interest, then come along to the Legal Innovators UK conference in London, Nov 6 and 7, where Harvey will be taking part.

For more information please see here.

And to get your tickets now, please see here.

See you there!