An AI Engineer’s Personal Journey Into Law

By Dr William Becker, CTO, Lexical Labs

Why I love the law (and lawyers)

It has always been of interest to me how the job of a lawyer mirrors that of a software engineer. On the face of it they seem entirely different professions: the sharp-dressing besuited, serious explainers of a complex system of rules on the one hand, and ill-groomed, care-free implementers of a complex system of rules on the other.

But besides the social stigma we currently apply to each group, if you poke under the surface it is actually very similar! Software, as the law, runs on a set of rules/commands, and both have their own respective languages that are largely impenetrable to outsiders. To excel in either you need a keen mind that can cut to the heart of a difficult problem and come up with a solution that satisfies a number of requirements. Logic, following long trains of interrelated ideas and a sheer-minded bloody determination are all key characteristics.

Dr William Becker, CTO, Lexical Labs

So when it comes to replicating the mind of a lawyer, sometimes it takes someone who is close but distant to do the best job.

We have been trying to understand the legal process for years at Lexical Labs, and in this whitepaper we hope to explain how we see the basic tasks of the legal trade when it comes to contract review and negotiation, and how they can be modelled and automated.

As a summary, there are several key parts to reviewing a contract:

  1. Comprehending what it says
  2. Relating that to the requirements of the parties
  3. Understanding how to remediate the contract

Let’s get into the details of how computers can be made to do these things:


When a lawyer reads a contract, they read it at multiple levels.

  1. A contract is literally on a computer screen or a piece of paper. You need to convert those pixels or ink blotches into letters and words and lines and clauses. This is relatively independent of the legal profession, but it still often needs to be done by a computer. Software known as optical character recognition (OCR) attempts to do this. It’s not always the simplest thing, and can be harder for contracts.

For example, legal documents have random-seeming (to the non-lawyer at least) letters, digits or roman numerals interspersed throughout as items in lists. These may be within a single paragraph, or split into separate lines. The digit for the numeral one ‘1’ looks very much like a lower case ‘L’ or an uppercase ‘i’.

To discern between them takes a contextual view of the nearby text, which a human can do (not without some effort sometimes!), but does not come standard in out of the box OCR software. Is that ‘i’ the first item in a list that is in roman numerals, or the 9th item in a list of letters?

We’ve seen cases where two ‘i’s follow a ‘g’, because the 9th item in the alphabetical list had a subordinate list underneath it which used roman numerals. It is relatively easy if you are reading a Word document that understands the “level” of the list, but harder when you are reading it directly from the paper.

  1. A lawyer understands that a contract is not just 100-odd pages of dense legalese; it has an overriding structure – a front page, a table of contents, some information about the parties, a set of recitals, a set of (hopefully) well-numbered clauses, and maybe some commercial terms in some tables or further related information as schedules or appendices. All this needs to be read independently, yet in context with the rest of the agreement. A reference from clause 2 to clause 3 in a schedule may be referring to the schedule’s clause 3 or the one in the main agreement, depending on the situation. Terms that are defined could be written in their own schedule or as a separate clause in the main agreement. For a system to understand a contract, it must understand these nuances.
  2. The beauty of a contract is that it is not just free-form text, but well thought out terse clauses but written in a hierarchical form. This means that a sub-clause or sentence cannot necessarily be read in isolation, but may be a fragment in a list that has some preamble 20 lines above, or more text that completes it below. It may use a capitalised term that may or may not be defined elsewhere. It may have a reference to legislation or regulations, or a reference to another clause or indeed an entire section of the document.

You could thus compare the structure of a contract to a tree, with each ordered clause being a branch, sub-clauses being further branches and each sentence being a leaf.

‘You could compare the structure of a contract to a tree.’

Trees are a common data-structure in computer science and being able to read a contract in such a way has many benefits – you can easily understand the context of a sentence (eg, what sentence starts and ends a sub-clause), refer to sentences (even if they aren’t numbered) and see which clauses are close by (very useful for understanding when more abstract clause references such as “this clause” or “the preceding section”).

  1. Once you understand how the clauses of the contract are laid out, a person reading a contract needs to know what content those clauses actually contain. Clauses in a contract are often very similar from one contract to another, if not in actual wording, then at least in meaning. Two contracts trying to achieve the same goal may not use clauses that are actually written the same way or in the same order, but may mean the same thing. So our next goal is to get computers to understand this similarity. There are two ways of doing this, from a position of understanding or a position of ignorance. Both of these have their uses.

Unsupervised approaches

It is interesting to understand how a computer which does not know anything about a contract could understand similarity between clauses that have no actual textual overlap. This is achieved by the mechanism of clustering. Clustering groups similar items together, e.g. if you had a bucket of balls you could group them together by size or colour or material. Tennis balls are of a certain size and tend to be green, and have a felt surface – but you can also get giant novelty tennis balls, and purple tennis balls. Footballs and volleyballs tend to be of a similar size, and are differentiated by their stitching.

Depending on the quantity of balls you have, the more detail you could go into – if you only have 10 balls you might just care about the kind of ball you have, if you have 100,000 balls you may well be more discerning about the quality of said balls, for example, grouping tennis balls into ones designed for certain clay vs grass courts.

Similarly, you can group sentences together using different methods. A simple attempt might be to group sentences with words that overlap. More complex approaches might still use words but also look at their word position (whether they are subjects or objects of a sentence, or subordinate clauses). However this won’t help when synonyms are used. Like an ignorant human, a computer can consult a thesaurus (eg WordNet) and understand whether two words are similar or not.

More recent technological breakthroughs such as Word Vectors collapse a word into a long series of numbers. Words that have similar numbers are considered to be closely related, e.g. King and Queen have similar numbers, as do “son” and “daughter”.

Using this approach you could group a number of similar documents into a number of similar clauses. You could even attempt to name these groups by using the most unique words in each group (the idea being that if it is used in one group but not another it should be a good signifier of the group’s lexical intent)

Then when you see new documents you could look at each sentence and see what group it most resembles and thus “classify” a document.

Supervised approaches

If you have some legal nous, you might be inclined to apply some of it to the problem directly. Instead of having a computer make groups, you could make them. You could find a sentence in a contract that is a key area of import, and repeat the process in a few other documents. You could then tell your computer that these sentences all achieve a similar purpose and tell it that if it sees another one, it should apply some lawyer-assigned label.

This is known as “training a classifier”. Many kinds of classifiers exist and do things with increasing levels of subtlety. A basic one might just see if the words are similar without regard for the order (similar to our basic clustering algorithm above). The state of the art now looks at the order of the word in relation to every other word, the part of speech of the words and will use Word Vectors to calculate similarity of meaning.

Making it real

So let’s say we now comprehend what clauses are in a contract. We can now ask, whether it is in our favour or not. Certain types of clauses just by their existence may be unfavourable and we may want them removed. Other clauses we might always want included. Most of the time though there will be some level of tolerance, and it must be determined whether a clause is favourable or not.

We may have generated our classifiers to only find clauses when it is unfavourable to us. This is a less than optimal method for transferability. It may be only on one kind of contract, or with a category of vendors that a clause is intolerable, so if we have trained it to trigger only when it is unfavourable, we won’t know whether it exists at all. So it’s better (and easier to train!) a classifier to find a kind of clause generally, then specify in another way whether it is actually acceptable or not.

A clause may be unacceptable for a number of reasons: it may be well drafted but a time period in it may be too long or a liability cap too high. It might be fine as an obligation for a vendor but not for a customer. It could be good (or bad) if there is another clause also in the contract. It could be a problem depending upon the scope of a referenced defined term. It may happen that it is agreeable as a right, but not an obligation.

The solution then needs to be matched to the particular issue you have. A time period can be extracted from the contract and you can set up acceptable bounds for it. Through Natural Language Processing techniques you can pick up who is giving the obligation and relate them to the parties of the agreement (whether it is by party name or the role of the party). A legal tech expert (who straddles the worlds of legal knowledge and AI technology) can then easily put these rules together behind a set of issues that describe the requirements of the user.


Finally, once you understand that there is a problem with a part of a contract, how do you then amend the contract in a mutually acceptable way. There are a few ways to do this, from simple process-based methods to approaches using modern AI.

The simplest – and least technical – way is to have an already setup set of favoured language and acceptable fallback positions. When an issue is triggered, you can then explain what the precise problem is and provide text that the user then would need to customise for the contract.

A slightly more intelligent route would be to make this text configurable so that key terms or amounts could be manually or automatically synchronised so they match the contract.

Given an already negotiated contract bank, you could mine this data so that commonly negotiated clauses were identified and if a clause resembles a negotiated contract’s original text, then provide the outcome as a suggestion.

More recent technological developments such as generative models like GPT from OpenAI which let you train a model with all of your previously worded clauses and ask it to generate a given position. It’s not perfect for generating all-complete legalese yet (it is much better at free-wheeling prose), but it is a glimpse into the future.


As you can see, the process of reviewing a contract is quite an involved process, but there are approaches to emulate each step and combining them together could give you a lawyer in a bottle. Of course, a lawyer isn’t just someone who reviews contracts, but also understands what the general position of a client is, what is special about a specific circumstance, and has a wealth of experience to guide and protect their clients from pitfalls.

We can use an understanding of the contract review process to do the heavy lifting on much of the standard legal work, and allow lawyers (and other people in an organisation who regularly deal with contracts) to achieve their ends in a fraction of the time.

If you would like to learn more about Lexical Labs and what it can do for you, then please see here.

[ Artificial Lawyer is proud to bring you this sponsored thought leadership article by Lexical Labs. ]