Paxton Hits 94% Accuracy On Stanford GenAI Benchmark

Legal genAI company Paxton AI has announced it has achieved ‘93.82% average accuracy on tasks in the Stanford Legal Hallucination Benchmark’. The company has also launched its own ‘AI Confidence Indicator’ to help lawyers make an informed judgment as to the tool’s responses, (see more below).

This comes on the heels of its recent Paxton AI Citator release, which achieved a 94% accuracy rate on the separate Stanford Casehold benchmark

Recently, contract platform Screens also announced 97% accuracy for its genAI doc review capability, providing detailed information on how it achieved this – see here.

These and other moves have been triggered by the fallout from the Stanford University genAI study into legal research tools, which found that – according to their approach using a series of test questions – they were far less accurate and more prone to hallucinations than had been expected.

At the same time, this publication and others have called for an industry consensus on standard measures and benchmarks for accuracy. To that end, UK-based legal tech group, LITIG, with leadership from John Craske, Head of Innovation at CMS, are spearheading a project backed by this site to see what can be done. In fact, there will be an initial meeting in London this Thursday to explore the challenges of developing standards and benchmarks for the use of genAI for legal work.

But, back to Paxton.

We don’t have space to reproduce the entire test and its related information ­– you can see it here. But, here are some key comments from the US-based company, which provides a range of solutions including for doc review and drafting.

The below is from their mini White Paper:

‘The Stanford Legal Hallucination Benchmark evaluates the accuracy of legal AI tools, measuring their ability to produce correct legal interpretations without errors or ‘hallucinations.’ High performance on this benchmark indicates a system’s robustness and reliability, making it a critical measure for legal AI applications.

Why This Benchmark Is Important

Legal AI tools, like those developed at Paxton AI, are increasingly relied on in professional settings where accuracy can significantly impact legal outcomes. The benchmark measures various tasks such as case existence verification, citation retrieval, and identifying the authors of majority opinions. High performance in these areas signals that AI can be a trustworthy aid in complex legal analyses, potentially transforming legal research methodologies.

Details of the Stanford Legal Hallucination Benchmark

– Scope: The benchmark encompasses approximately 750,000 tasks that span a variety of legal questions and scenarios. These tasks are derived from real-world legal questions that challenge the AI’s understanding of law, its nuances, and its ability to apply legal reasoning.

– Task Variety: The tasks within the benchmark are designed to cover a wide range of legal aspects, including statutory interpretation, case law analysis, and procedural questions. This variety ensures a thorough evaluation of the AI’s legal acumen across different domains.

– Complexity: The tasks vary in complexity from straightforward fact verification, such as checking the existence of a case, to more complex reasoning tasks like analyzing the precedential relationship between two cases.

Paxton AI’s Testing Methodology

For our assessment, Paxton AI selected a representative random sample of 1,600 tasks from the comprehensive pool of 750,000 tasks available in the benchmark. This sample was strategically chosen to include examples from each category of tasks provided in the benchmark to maintain a balanced and comprehensive evaluation.

– Random Sampling: Tasks were randomly selected to cover different legal challenges presented in the benchmark.

– Evaluation Criteria: Paxton AI processed each task and the responses were evaluated for correctness, presence of hallucinations, and instances where the AI abstained from answering—a critical capability in scenarios where the AI must recognize the limits of its knowledge or the ambiguity of the question.

– Selection of tasks: The paper describes 14 tasks (four low-complexity, four moderate-complexity, and six high-complexity tasks). The published data file contains 8 tasks, the four low- and four moderate-complexity tasks, but those eight tasks are split further into a total of 11 distinct tasks. For example, the task ‘does this case exist?’ has two separate tasks for real and non-existent cases. Of the 11 tasks in the data file, we excluded the ‘quotation’ and ‘cited precedent’ tasks. These tasks were not included in our evaluation due to specific alignment considerations with our current testing framework. We aimed to ensure that all evaluated tasks seamlessly fit within the existing architecture of our system, providing a consistent and accurate assessment across the board. We also excluded the ‘year overruled’ task due to a discrepancy between the published dataset and the metadata Paxton relies on for answering questions.

Transparency and Accessibility

– Open Data: Committed to transparency and fostering trust within the legal and tech communities, Paxton AI is releasing the detailed results of our tests on our results. This open data approach allows users, researchers, and other stakeholders to review our methodologies, results, and performance independently.

– Continuous Improvement: The insights gained from this benchmark testing are invaluable for ongoing improvements. They help us identify strengths and areas for enhancement in our AI models, ensuring that we remain responsive to the needs of the legal community and maintain the highest standards of accuracy and reliability.

The Stanford Legal Hallucination Benchmark serves as a critical tool in our efforts to validate and improve the performance of our legal AI technologies. By participating in such rigorous testing and sharing our findings openly, Paxton AI demonstrates its commitment to advancing the field of legal AI with integrity and scientific rigor.’

Paxton AI Data, July 2024

So, there you go.

As mentioned before, the more legal tech companies that are open, transparent and most of all show how they got their accuracy results, the better.

The challenge that the field of legal tech, and especially legal AI, has faced is the tendency for vendors to just say ‘we have 95% accuracy’, but without putting that figure into any context.

Without knowing the measures used, or how the tests were carried out, or the context of the use case – as every use case may have different elements that alter accuracy needs – we cannot know much about its significance.

That then means law firms and legal teams have to conduct their own detailed tests. This is fine for large firms with plenty of resources, but many law firms don’t have the time, staff, or free cash to carry out such scientific tests. In fact, even large firms perhaps have better things to do with their resources….!

Moreover, it’s not unreasonable to expect a provider of a tech tool that will be used on high risk legal matters to provide some assurances as to what it can do.

Plus, in the wake of the Stanford study, the legal market needs this reassurance if genAI is to be all it can be.

This site commends Paxton for their work.

P.S. The Confidence Indicator

It’s not the first legal AI company to provide a confidence indicator, some of the first wave of NLP / ML pioneers launched similar features a while ago. Even so, this is useful. Below Paxton explains how it works:

‘This new feature is designed to enhance user experience by rating each answer with a specific confidence level—categorized as low, medium, or high. Additionally, it offers valuable suggestions for further research, guiding users on how they can delve deeper into the topics of interest and verify the details.

It is important to clarify that while large language models (LLMs) can generate confidence scores for their responses, these scores are not always indicative of the actual reliability or accuracy of the information provided. The Confidence Indicator in Paxton AI operates differently. Instead of relying solely on the model’s internal confidence score, our Confidence Indicator evaluates the response based on a comprehensive set of criteria, including the contextual relevance, the evidence provided, and the complexity of the query. This approach ensures that the confidence level assigned is a more accurate reflection of the response’s trustworthiness.

Confidence levels:

– Low: Paxton is unsure of its answer. This often happens when the query lacks context or involves an area that is not well-covered by the source materials. In this case, we recommend providing additional context to Paxton by providing more detail, checking the source materials to understand if they correspond to your query, and breaking down multi-part questions into smaller ones.

– Medium: Paxton has moderate confidence in the response. For less critical tasks or areas that have a lower need for accuracy, users can choose to rely on these responses. For higher confidence, users should follow up with more questions, provide additional context, or try breaking complex multi-part questions into smaller queries.

– High: Paxton is confident in its answer. While users should validate all AI generated answers and read the citations provided, high confidence responses are the highest quality and reliability.’

Overall, great stuff. More transparency = more confidence in the technology, which then leads (hopefully) to more meaningful use of the tech at scale to really help change the business of law.