Beyond Accuracy to Understanding: Leveraging Stanford’s Hallucination Framework

By Tycho Orton.

In the wake of Stanford University’s recent paper on AI-driven legal research tools, debate has erupted over the accuracy of these systems. However, this focus on accuracy misses the broader implications and valuable insights the study offers. As we navigate the integration of artificial intelligence into the legal profession, we must look beyond the headline-grabbing statistics and practical insights this research unveils.

For those unaware, the Stanford study examined leading AI tools from LexisNexis and Thomson Reuters. It revealed that these systems still produce significant rates of ‘hallucinations’ – false or misleading information.

This has led tech-optimists like myself and the author of GenAI Hallucinations? Lawyers Aren’t Perfect Either to ask what constitutes ‘good enough’ performance, whether from human lawyers or machine assistants. It’s tempting to defend new technologies by pointing out that humans, too, make mistakes. However, this argument has had limited success in increasing the adoption of other legal technology and will likely have similarly limited impact on generative AI adoption.

There are several reasons why comparing AI to human performance is problematic:

  1. Expectations differ significantly. We are accustomed to computers performing tasks consistently and to natural variance in human performance.
  2. There’s a psychological element at play. We enjoy teaching colleagues but find dealing with machine-generated mistakes frustrating.
  3. We’ve developed strategies for eliciting improved performance from humans but lack similar mechanisms for interacting with AI systems.
  4. We have an intuitive understanding of the types of errors humans typically make, allowing us to spot and correct human mistakes more readily. AI errors can be less predictable and potentially more insidious.

These factors combine to create a landscape where the mere comparison of error rates between humans and AI fails to capture the full picture. Scepticism of AI’s usefulness is a combination of rational concern about the limitations of the technology, reluctance to incur the switching costs associated with adopting and learning to use AI, and emotion.

The true value of Stanford’s research lies in two key areas:

  1. Establishing a framework for transparent benchmarking of legal AI
  2. Identifying the types of queries that are most likely to confound these systems.

The first will ultimately make AI more reliable, reducing the problems presented by our expectations and psychological satisfaction. The second provides a framework for understanding the causes of AI hallucinations and hence clues to how we can reduce them now – through prompting – and in the future – through AI training and tuning.

The Stanford study’s greatest contribution may be its typology of error causes in AI-generated legal information. By understanding the root causes of inaccuracies – whether they stem from naive retrieval, inapplicable authority, reasoning errors, or other factors – we can develop strategies for improvement. This knowledge empowers users to refine their prompts and interpret results more critically, potentially reducing the incidence of hallucinations in AI outputs.

For example, my colleagues and I have found that breaking down complex queries about articles of association into simpler, constituent questions helps overcome ‘Reasoning Errors’ – a common issue identified in the Stanford study, particularly for Westlaw and Practical Law AI tools. This approach of decomposing legal questions into multiple, simpler prompts can significantly improve AI outputs and sidestep reasoning limitations.

Naive Retrieval (e.g. grounding an answer on procedure in a negligence case on judicial review cases) and Inapplicable Authority (e.g. citing an overruled case) appear to be harder challenges to prompt around, though with some creative thinking there may be work-arounds. Pleasingly, Sycophancy (agreeing with a mistaken user) is low for all three AI tools tested.

Eventually, technology improvements should resolve these problems. Until then, information on AI tools’ limitations is key to successful use and adoption. As technology improves, lawyers must conduct ongoing evaluation and refinement of AI. This is crucial for commercial success and to ensure a clear understanding of the risks of the technology. The Stanford study provides a valuable starting point, but it’s crucial that we continue to develop robust frameworks for assessing these tools.

Benchmarks and frameworks such as that offered by Stanford should be embraced by legal tech customers and suppliers.

Customers should demand this sort of transparency because it:

  • Increases pressure to improve AI products
  • Allows compensation for an AI product’s gaps

Suppliers should embrace this transparency because it:

  • Provides an independent metric of value
  • Offers objective criteria for procurement teams to demonstrate value

Thomson Reuters and LexisNexis have both indicated that they willing to work with others to develop these frameworks.

The Stanford study’s framework for evaluating legal AI and its typology of error causes offer invaluable tools for improving these systems. Legal professionals must embrace these insights to drive the responsible integration of AI into practice. All stakeholders in the legal industry – practitioners, tech suppliers, and researchers – should adopt and build upon this framework.

Practitioners should demand transparency from AI providers and use the error typology to refine their prompts and interpret results critically. Tech suppliers must leverage these benchmarks to enhance their products and demonstrate value objectively. Researchers should continue to develop and refine these evaluation methods.

The legal profession can accelerate the development of more reliable AI tools. It can enhance understanding of their limitations. It can deliver better outcomes for clients and the justice system. The future of legal AI depends on shaping it responsibly and effectively using and improving the robust frameworks now available.

About the author: Tycho Orton is an associate in CMS’s Banking – Restructuring and Insolvency team, with a strong interest in legal tech. The views here are his own and not the firm’s.