‘Language in the Age of Machine Learning’
By Johannes Stiehler, CTO, of text analytics company Ayfie.
Language is a key property of being human.
Most human interactions either centre on language or at least are accompanied and formalized with language. Consequently, most knowledge is still stored in the form of human language, thus making it hard to access by a machine.
Ever since the rise of computers, accessing and producing language has been one of the key areas of research and ambition for computer science. From the famous Eliza, to the promises of expert systems, to the golden age of search engines (basically finding language using language) and now to the advent of ubiquitous digital assistants – whose main feature it is to exclusively offer language-based interaction.
Making human language accessible to machines has been studied in academia and business for a long time since there is a big promise attached to solving this problem: the promise to make humans and computers interact at eye-level.
Language in eDiscovery and Document Review
eDiscovery/document review processes and knowledge management applications such as insight engines are at their core predominantly language-driven.
However, most applications treat this language as if it was simply a sequence of unconnected words.
Brute-force statistics and machine learning algorithms are then applied to this string of characters, hoping that the algorithm is good enough to magically create the expected outcome. Since language is not a sequence of unconnected words, this approach usually yields suboptimal results.
Consider this simple sentence:
‘Robert Keating requested a meeting with the London department tomorrow.’
To a simple machine learning algorithm (deep learning aside), this sentence would look like this:
‘robert | keating | requested | a | meeting | with | the | London | department | tomorrow’
Thus, the word “requested“ would have the same value as the word “department“ – at least initially – and the words “robert“ and “keating“ would be juxtaposed but not more related than the words “department” and “tomorrow”.
But, in reality, some of these words form larger semantic entities while others contribute (almost) nothing to the meaning of the sentence. Hence, to the casual glance of a human, the key elements of this sentence would rather look like this:
‘robert keating | meeting | london department | tomorrow’
The key semantic entities or concepts summarize the content of this sentence into the fundamental who, what, where, when.
It is obvious why a human who can quickly grasp these underlying semantic “highlights” is much better at understanding and summarizing the contents of texts than any software available.
Language and machine learning
This is why we believe text analysis and machine learning algorithms should not be run on some “bag of words” representation of the text, but instead on the key semantic concepts as extracted by natural language technology. This increases the quality of the output much more than just switching to a different machine learning algorithm.
In one experiment we ran together with a customer regarding automated text classification, switching to “support vector machines“as a classification algorithm improved the outcome by three percentage points. However, using extracted semantic concepts instead of single words or arbitrary bigrams (two-word sequences) to feed that algorithm improved the quality of the output by 48%.
Let me give you another example why this is the case. The goal of automated text classification is to sort documents (or other pieces of text) into predefined buckets or categories.
This is achieved by first manually labelling some documents with their category (e.g. “energy“, “health,” “travel,” “politics”) and then using those to train a classification algorithm.
The algorithm will learn the similarities and differences in the text for each category and thus be able to automatically apply the same category to unknown documents, with adequate precisions, as long as enough training documents for each category have been labelled.
As in the examples above, a simple classifier would look at the individual words in each document, i.e. use them as so-called “features” for the algorithm.
Now, consider the following mini-documents:
1. We went hiking on Susan Island.
2. Susan Island issued a press release on Sunday.
3. Susan dreamt of a vacation on an island.
If the classifier were to use the simple “bag of words” approach, obviously all three documents would have two prominent words in common – “Susan” and “Island” – while semantically, they have almost no overlap.
Even when using (arbitrary) bigrams or even simple entity extraction, document one and two would still be considered closer to each other than to document three. In reality, if at all, document one and three are closer to each other than they both are to two because they at least both deal with islands (and activities).
Only if we base the machine learning algorithm on the semantic structure of the document, it can pick up that Susan Island (GEO) is something fundamentally different from Susan Island (PERS) but rather a specialization of the “island” concept in document three.
Even in the era of machine learning, the problem of representing language in a way that is fully accessible to computers has not been solved yet.
For instance, modern digital assistants are a far cry from “understanding” language. By means of semantic analysis, we can elevate further processing of the text to a “higher level” of abstraction, thus making it both more robust and more precise.
Per the late UK linguist J. R. Firth (1890-1960): “You shall only know a word by the company it keeps.” Please let us know if you’ve run this experiment — “extracted semantic concepts” versus GloVe — and if so, what were the results?