Christian Braun, an AI engineer in Germany at international firm Osborne Clarke (OC) ran Harvey’s recently highlighted BigLaw Bench tests for genAI accuracy and has claimed that its OC-GPT-4o outputs can ‘surpass’ what the legal genAI pioneer can do.
‘We decided to put it to the test using our own securely hosted Azure environment (OC-GPT based on GPT-4o),’ Braun said on a LinkedIn post yesterday.
See Harvey’s own test results – see here:
–
Meanwhile this is what OC’s Braun found when he attempted to run Harvey’s own BigLaw Benchmark using the sample questions provided.
The OC engineer said on LinkedIn that this was: ‘Astonishing! Not only did we match Harvey’s outcomes, but in some cases, we even surpassed them.
‘This underscores a crucial point: empowering all law firm staff to harness AI effectively is key. Whether it’s through well-crafted prompts tailored for specific use cases or by teaching prompt engineering skills, the potential is immense.
‘Now, here’s a thought: It would be fascinating to see if and how much better well-crafted prompts on Harvey perform compared to our results on plain GPT4o. Has anyone explored this yet?
For those interested in the specifics:
– We used Harvey’s published test cases.
– Each case was prompted with one-shot and with various prompt engineering techniques (e.g. Chain-of-Thought, Few-Shots).
– Hallucinations were excluded from the evaluation.
– The evaluation was based on LLM responses to Harvey’s test case questions.
– We also considered the task weighting from Harvey’s BigLaw Bench publication.’
—
If this is correct why does it matter? Because the entire reason for having a product such as Harvey – and any other legal tech genAI tool – is that there is tons of fine-tuning, system prompting, highly-developed RAG and more, that goes into a legal genAI product, and perhaps also multiple calls, perhaps even to multiple LLMs, to get accurate answers.
It’s not clear what their ‘own securely hosted Azure environment (OC-GPT based on GPT-4o)‘ entails, and how much tooling and refinement there is there compared to what Harvey has developed. But, either way it’s a major claim that the law firm’s own OC-GPT ‘surpasses’ Harvey. In Braun’s comments he also stresses the importance of prompting to get good results.
But, are these really the same test conditions? Harvey marked down accuracy results by getting subject matter experts within its growing team to examine the responses it produced, and also those of Claude and GPT-4o. They didn’t look just for ‘an answer’, they considered how much an answer would help a lawyer to complete a given task.
As they explained here: ‘Harvey’s research team developed bespoke rubrics to evaluate each task. These rubrics establish objective criteria that would be necessary for a model response to effectively accomplish a given task. They also penalize common LLM failure modes such as incorrect tone or length, irrelevant material, toxicity, and hallucinations. Combined, these rubrics effectively capture everything a model must do to effectively complete a task, and everything it must avoid to ensure it completes that task in a safe, effective, and trustworthy manner.
In order to convert these criteria into a benchmark, each affirmative requirement was assigned a positive score based on its importance to completing the relevant task. Negative criteria, such as hallucinations, were assigned negative scores as the human effort to correct errors can reduce the utility of an otherwise complete piece of AI work product. Answer score is computed by taking the combination of these positive and negative points and dividing by the total number of positive points available for a task.’
I.e. this was a very detailed assessment of the answers, with human judgement used to deduct points from each response based on a range of things – and as noted: ‘They also penalize common LLM failure modes such as incorrect tone or length, irrelevant material, toxicity, and hallucinations.’
While OC says that they didn’t include hallucinations, they don’t explain how or if they went into all of the above very subtle additional criteria for judging a response, and this is likely to be critical to the assessment of what comes out of BigLaw Bench.
As noted when Harvey made the Bench evaluation partially public, their approach is actually surprisingly hard on their own outputs – but it’s that way because they’re putting it under a lot of lawyer-level scrutiny, rather than just looking for the ability to create any response that seems to be OK.
Moreover, it’s not clear how many test cases Braun did. It looks like Harvey only made six examples public, but that is out of dozens they used for their own assessment.
–
All of this only underlines the need for common standards for the evaluation of legal genAI – because if we can have two organisations run the same test – or believe they have done so – and get very different results then we have a lot of work to do.
Of course, the simple answer could perhaps be that OC didn’t actually run the tests as Harvey intended they would be run, and that’s why the scores seem to be so different.
But, again, it underlines why we need a shared approach. The genAI standards project has much work to do still.