What Legal AI Benchmarks Reveal That Model Names Don’t

By Daniel Lewis, CEO, LegalOn.

Foundation models are improving quickly. One useful measure is software engineering: the length of coding tasks that frontier models can complete is now doubling roughly every four months.

That pace of change matters for legal AI. But general model progress doesn’t tell legal teams everything they need to know about how these models perform on specialized legal work. Contract review, in particular, turns on precise language, thresholds, cross-references, missing terms, and multi-part standards.

That is why LegalOn has released the 2026 Contract Review Benchmark, a deep evaluation of how leading AI models perform on contract review. The benchmark tested 11 AI models across 3,282 head-to-head reviews and 21 precision-critical guidelines. We test the models in their raw form against how they perform when placed within LegalOn’s harness – a structured system engineered specifically for in-house legal work, built on top of the foundation models.

We don’t expect buyers or users of legal technology to follow model releases as closely as we do. Nor do we expect people to accept vendor benchmarks without some skepticism. The question is whether the benchmark measures something real, in a way that helps legal teams understand what AI can and cannot do.

We think this benchmark highlights four things worth caring about.

Download the free benchmarking report here.

First: leading models still fail, on their own, on important contract review tasks.

The provisions tested were not obscure. They included assignment rights, PHI ownership language, NDA purpose clauses, SOW incorporation requirements and manuscript review timelines. These are common contract review issues. They are also the kind of issues where a wrong answer can create real legal or business risk.

We consistently found that general-purpose models often identified the right topic but missed the legal standard.

Finding an assignment clause is not enough if the guideline requires an unconditional assignment right with no consent requirement. PHI handling obligations are not the same as an express PHI ownership acknowledgment. A manuscript review right does not satisfy a guideline if the review period is too short. A provision that satisfies one part of a two-part requirement does not satisfy both.

This is why contract review is a hard AI task. The model cannot just sound fluent. It has to determine whether the contract meets the standard. In many cases, the relevant failure is not what the contract says, but what it doesn’t say.

Second: the harness around the model matters. A lot.

It is natural to ask which foundation model a legal AI product uses. But that question is incomplete. The same model can perform very differently depending on how the product uses it. Think of it this way: if the model is the engine, the harness is the chassis, the drivetrain, and the dashboard. It is the software that wraps around a raw AI model, transforming an unpredictable chatbot into a dependable worker capable of executing complex tasks.

A general-purpose model reviewing a full contract in one broad pass is doing a different task than a system that breaks the review into structured, provision-level checks. LegalOn’s harness is built around that second approach. Each check is tied to a specific guideline and a specific part of the contract.

This matters because contract review is actually many small tasks running together. Is the clause present? Is the required statement included? Is the number within the acceptable range? Are both conditions met? Does the SOW incorporate the MSA? Is the missing language actually missing?

A good legal AI system has to organize the work in a way that matches how legal review is done. The model matters, but the harness matters too. For contract review, that can be the difference between a system that is impressive in a demo and a system that is reliable enough for daily legal work.

Third: LegalOn’s investment in that system showed up in the results.

LegalOn ranked first across all 21 provision types. Our ELO score was 87 points above the next closest model and more than 400 points above the best GPT model tested. LegalOn’s confidence interval did not overlap with any tested model, indicating a statistically reliable performance gap.

The benchmark also measured speed. LegalOn completed a full review in 2.3 seconds. Claude Opus 4.6, the strongest general-purpose model tested on speed, averaged 40.4 seconds per contract.

We’re proud of those results, but we’re not surprised by them.

Much of the work in legal AI is not visible in a product screenshot. It is the legal content, review architecture, evaluation design, model orchestration, product decisions and repeated testing that sit behind the interface. We’ve spent a great deal of time building that system for contract review. The benchmark gives one measure of the result.

Fourth: this is the most current and robust benchmark for contract review.

We believe this evaluation is the most robust and current public benchmark available for contract review. It was designed to test legal work at the level where mistakes actually happen: the provision.

For each contract and provision, two reviews were run side by side: one from LegalOn and one from a general-purpose AI model. The baseline models received the full contract and all guidelines at once, returning MET or UNMET determinations in a single pass.

An independent LLM judge, separate from the models tested and blind to authorship, assessed which review was more accurate, complete and useful. The judging criteria included correctness, evidence quality, article identification, completeness and reasoning quality.

To control for position bias, every comparison was run twice with the order reversed. A result counted as a win only when the same system was preferred in both orderings. If the preference flipped, the result was treated as a tie. Legal experts also validated a sample of the judge’s outputs against professional legal standards.

No benchmark can replace a buyer testing a system on its own contracts, playbooks and risk standards. But a serious benchmark can still be useful. It can show where general-purpose models perform well, where they fail, and what kinds of product architecture are needed to turn model capability into dependable legal work.

Final thought

Foundation models will continue to improve, but legal AI should be evaluated on legal tasks, not only on general model reputation. In contract review, the relevant question is not just which model is underneath the product. It is how the product performs as a harness, a system, and whether it can reliably apply a legal standard to the details of a contract.

We hope the 2026 Contract Review Benchmark provides useful insight for anyone interested in how today’s models perform on legal work. 

[ This is a sponsored thought leadership article by LegalOn for Artificial Lawyer. ]


Discover more from Artificial Lawyer

Subscribe to get the latest posts sent to your email.