The Linguistics of Natural Language Processing

12 min readMay 10, 2021

Introduction

In late 2016, exactly ten years after the launch of Google Translate, Google quietly announced that it would be replacing Translate’s original algorithm, Phrase-Based Machine Translation, with a new translation system coined Google Neural Machine Translation. The performance increase that ensued was sudden and drastic, characterized by the New York Times Magazine as “roughly equal to the total gains the old [system] had accrued over its entire lifetime.”[6]

The previous system had used statistical methods to match phrases in bilingual corpora, like many other machine translation systems at the time. The new system, on the other hand, made use of neural networks, a statistical algorithm conceived in the 1950s that exploded in the late 2000s to achieve state-of-the-art performance on countless machine vision, audio processing, and natural-language processing benchmarks. But to fully contextualize the revolutionary improvements in natural-language processing ushered in by neural networks, we must begin with a brief history of computers and language.

A Brief History of NLP

Alan Turing, a foundational character in the field of computer science, was among the first academics to seriously consider machines as capable of human-like thought, proposing the famous Turing Test in 1950 as a criterion for machine intelligence: a computer passes the test if it is capable of deceiving human judges, through real-time conversation, that it is human. Though the Turing Test has various shortfalls as an arbiter of machine intelligence (Turing himself acknowledged the vagueness of the question of whether a machine can think, and instead posed the question as to whether a machine can imitate human behavior), Turing’s selection of human language as the testing domain highlights the seemingly insurmountable difficulty of teaching computers to understand and generate language.

The first early success in creating computational representations of human language came when Noam Chomsky published “Syntactic Structures” in 1957, revolutionizing the field of linguistics with his model of transformational generative grammar. In the work, Chomsky sought to mathematically formalize the high-level structures of language, in particular syntax — how arrangements of words can be combined in a grammatical way, or by Chomsky’s definition, in a way that would be “intuitively acceptable to a native speaker.” After establishing that the grammaticality of a sentence is independent of its semantic content (providing the grammatical, but meaningless sentence “colorless green ideas sleep furiously” as an example), Chomsky proposed a finite set of symbolic replacement rules for breaking down a sentence grammatically into its constituent parts, known as phrase structure rules. He further asserted that such finite sets of rules are capable of generating the infinitude of grammatical sentences in a particular language, and –perhaps more controversially– that the recursive nature of these rules produces “digital infinity” in the human brain, forming the basis for the unbounded expressive capabilities of the human language faculty as a whole [2].

An example of a simple phrase structure diagram in English, where S denotes a sentence composed of a noun phrase NP followed by a verb phrase VP. Words in the sentence are terminal nodes in the tree.

Phrase structure grammars were a crucial tool used by early “rule-based” machine translation systems. Given the phrase structure grammar rules for the input and target languages, as well as a dictionary mapping between words in the two languages, such rule-based translation systems operated by mapping the grammatical structure of an input sentence onto the target language (using exactly these phrase structure rules), while making appropriate modifications to a translated word’s form according to its syntactic/semantic context. [5]

However, specifying various languages’ phrase-structure grammars proved to be a highly labor-intensive task, and rule-based machine translation typically requires enormous dictionaries and tediously hand-crafted linguistic rules. Moreover, RBMT-based systems generally produce translations that are comprehensible to native speakers but lacking in fluency.

The issues plaguing RBMT were partly resolved in the late 1980s by the advent of statistical machine translation, which utilizes large datasets of bilingual text to predict the most probable translation of a sentence. Though SMT initially translated input sequences word-by-word, phrase-based models translating larger chunks of linguistic input at a time quickly became the standard due to their superior performance. Interestingly, the “phrasemes” selected by an SMT system for translation do not generally coincide with linguistic phrases in the syntactic sense, and SMT models restricted to translating linguistically meaningful phrases have been shown to perform even worse at translation than their unrestricted counterparts. [5][9]

Unlike RBMTs, statistical machine translation models include very few hard-coded linguistic rules, while producing noticeably more fluent translations than RBMTs. With both the availability of machine-readable (i.e., digitized) bilingual text corpora and compute power increasing exponentially around the turn of the century, statistical translation systems began to achieve commercial viability, with companies such as IBM, Microsoft, Google, and SYSTRAN (founded in the 1960s as one of the first machine translation companies) each offering their own statistical machine translation variant.

However, neither rule-based machine translation nor statistical machine translation appears to process linguistic input in a manner remotely similar to the way humans do. Firstly, the “mental grammars” in the brains of human language speakers –the physical object of linguists’ study– encode an understanding of language that is abstract and unconscious; words are not physically definable, and we are not consciously aware of the grammatical rules we use to understand and produce language.[4] Rule-based translation systems appear to immediately conflict with the latter principle, as the grammar of a language is explicitly defined and utilized for comprehension/translation in RBMT. Secondly, while statistical machine translation is in some ways analogous to the statistical language acquisition hypothesis (which seeks to explain aspects of human language acquisition as the learning of statistical distributions of linguistic units), the statistical distribution observed by SMTs is of a different nature, defined over two languages rather than within a single one.

RBMTs and SMTs were the dominant two methods for NLP until the early 2010s, when a new language model using neural networks took the field by a storm.

Neural Networks for NLP

Overview

Neural networks are a statistical/computational architecture loosely inspired by biological neural networks occurring in the brain. Their utility stems from their remarkable ability to learn complex functions of input data from large datasets; the universal approximation theorem formalizes this capability by stating that neural networks are capable of approximating a wide class of multi-dimensional functions, given a sufficient number of parameters.

Structurally, neural networks are comprised of layers of interconnected nodes that propagate signals forward through the network to produce an output as a function of some input. As an example, in feedforward neural networks (the “vanilla” variant), the value of a single node is calculated by applying some nonlinear function to a weighted sum of all nodes in the previous layer. The weights associated with each node in the network form the parameters of the network, which may be varied to produce different outputs for a given input.

Figure 1: Simple feedforward neural network architecture

The key algorithms that allow neural networks to learn a function over some input domain are backpropagation and gradient descent. If we are able to define a differentiable error function on the output of the neural network (designed to tell the neural network how poorly it is doing at estimating the target function to be learned), backpropagation allows us to efficiently calculate the derivative of this error function with respect to the weights in each layer; this may be used, in turn, to update the value of each weight in the neural network in a way that will decrease the error function, assuming that the weight updates are sufficiently small. If this weight updating algorithm is then iterated millions of times over a dataset, the error function will slowly converge to some minimum in a process known as “gradient descent,” with the neural network parameters transforming to better represent the target function over time.

Illustration of gradient descent in one dimension. w represents the network weight, and J(w) represents the error function. Gradient descent is not guaranteed to converge to the global cost minimum.

RNNs & LSTMs

In 2014, neural machine translation (NMT) was introduced as a new class of statistical machine translation, leveraging neural networks to learn the statistical model translating inputs from one language to another. However, the applications of this novel neural NLP are far broader than just machine translation, as we will see.

Two important breakthroughs were necessary to translate the problem of NLP into the domain of neural networks.

First, neural networks operate on vectors of numbers, and thus a meaningful mapping from words to vectors is a necessary starting point for a neural network to begin working with language. This was accomplished by the eponymous word2vec protocol, devised in 2013 by a team of Google researchers. Broadly speaking, word2vec refers to a collection of related neural network architectures used for learning vectors known as “word embeddings” from large natural language datasets. The neural networks usually learn these word embeddings by one of two mechanisms: the “bag-of-words” approach, in which the neural network must predict a word given the words in its immediate context, and the “skip-gram” approach, in which the neural network must predict words a given range before or after the current word. Both approaches give rise to word embeddings that encode meaningful semantic and syntactic content, which may be used downstream for other NLP tasks.

Second, written language is a sequence of words, and early neural network architectures were not capable of processing sequences with complex temporal relationships as input. Recurrent Neural Networks (RNNs), in particular Long Short-Term Memory (LSTM) RNNs, were developed to accept and process such sequential input, and thus are well-suited for processing language, video, and other time-series data. In general, RNNs include a memory mechanism for storing the internal or “hidden” state of a sequence, which is fed as an input into a neural network along with the next element in the sequence to produce some output and a new hidden state. In the image below, s denotes the hidden state, x denotes the input, and o denotes the output, where nodes represent RNN “cells”.

In the context of NLP, the inputs x are typically the embedding vectors of words in a sentence, and the hidden state at a given time step can be thought of as an intermediate vector representation of the sentence. Such recurrent architectures can be utilized for language modeling in several different ways:

a) Many-to-One: text classification, final-word prediction, masked-word prediction

b) One-to-Many: text generation from one word, sentence completion

c) Many-to-Many (Seq2Seq): translation, text summarization

Seq2Seq LSTM model for English-French translation, where each cell is an LSTM cell taking a word embedding and the previous hidden state as inputs.

The Neural Machine Translation system launched by Google in 2016 used a Seq2Seq RNN architecture similar to the one depicted above, achieving a stunning 60% reduction in translation errors over the previous Phrase-Based Machine Translation system.

Transformers

The field of NLP is rapidly evolving, however, and even the highly successful RNN/LSTM architectures were quickly displaced as the standard for NLP by a new neural network architecture known as a Transformer. First proposed in 2017, Transformers differ from recurrent architectures in that sequential input data need not be processed in order; rather, an entire sequence is fed at once to an attention mechanism, which assigns a weight to every word pair in the sequence in order to identify each word’s syntactic and semantic context. In fact, a multi-layered attention mechanism is basically all that is necessary for a state-of-the-art Transformer model. This change resolved a major computational bottleneck in RNN/LSTM architectures, whose sequential nature precluded the parallelization necessary for training on enormous natural language datasets.

GPT (Generated pre-Trained Transformer), a multi-purpose language model created by OpenAI, as well as BERT (Bidirectional Encoder Representations from Transformers) are two canonical examples of Transformer-based architectures that use attention mechanisms to perform a variety of language modeling tasks. Both follow an encoder-decoder structure, in which a decoder Transformer network is trained to reconstruct portions (or the entirety) of some output string from an “encoded representation” of some input string, as produced by an encoder Transformer network. Furthermore, no explicit labels for parts-of-speech, syntax, etc., are necessary to train such encoder-decoder models to understand complex linguistic structures, paralleling the relatively self-supervised nature of childhood language acquisition.

A high-level overview of the BERT architecture, trained to predicted masked words in an English sentence.

The Linguistics of Transformers

The developments in neural NLP and NMT over the past decade have upended the long-held notion that statistical methods are insufficiently powerful to capture the nuances of linguistic structure. Indeed, a close examination of the word embeddings and attention weightings learned by the BERT architecture reveals sophisticated representations of syntax and semantics.

Teasing out the linguistic structures learned by Transformers requires some ingenuity, however. In their recent paper “Emergent linguistic structure in artificial neural networks training by self-supervision”, Manning et al. [7] proposed a method for constructing phrase-structure trees using the contextualized vector representations of a sentence’s constituent words, as illustrated below:

Surprisingly, the trees generated using this method coincide almost exactly with their human-annotated counterparts. In the example sentence below, the blue brackets represent the tree structure generated by BERT, while the black brackets represent the human-annotated tree structure.

But aside from their performance on small language subtasks such as creating phrase structure trees or labeling parts of speech, how computationally similar are such artificial neural language models to the biological neural circuits responsible for the human language faculty?

Facebook AI researchers Cauchateux & King explored exactly this question in their paper “Language Processing in Brains and Deep Neural Networks: Computational Convergence and Its Limits” [1]. First, functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) were used to record the brain activity of 102 human subjects upon being presented with a word either in isolation or within a narrative. This brain activation data was then compared to the “activations” (i.e., the embeddings produced at each layer) of deep Transformer architectures previously trained to perform general language modeling tasks.

Their results confirmed earlier investigations, finding that in the absence of word context, word embeddings in the middle layers of Transformer neural networks correlated the most significantly with fMRI and MEG brain signals in the fronto-temporo-parietal region –– an area known to be involved in low-level linguistic processing. Earlier and later layers in the Transformer networks showed little to no correlation with the brain data. Correlations were higher for the Transformer networks that achieved better accuracy on their respective language modeling tasks. Furthermore, the correlation between middle-layer Transformer word embeddings and brain activations increased when the words were contextualized for both the Transformer and the test subjects. Altogether, this provides compelling evidence for a meaningful correspondence between the two abstract linguistic representations: while state-of-the-art neural NLP models surely pale in complexity to their biological counterparts, results from the neuroscience of language seem to empirically justify the notion of embedding words in a high dimensional space to capture latent semantic/syntactic content [1][3].

Conclusion

As we have seen, the field of NLP has transformed rapidly over the past decade, moving away from intricate rule-based language models in favor of self-supervised neural network models capable of learning linguistic structures without explicit human guidance. Although the early algorithms used for machine translation, part-of-speech tagging, and related language-comprehension tasks had scant foundings in biology, intriguing analogies and correlations have been found between state-of-the-art neural models and activity in the brain’s linguistic processing centers. Perhaps in the future, artificial neural networks will provide linguists and neuroscientists with a powerful new means of studying the computational circuitry of the human language faculty.

References

Caucheteux, Charlotte, and Jean-Remi King. “Language Processing in Brains and Deep Neural Networks: Computational Convergence and Its Limits,” 2020. https://doi.org/10.1101/2020.07.03.186288.
“Chomsky’s Grammar.” Encyclopædia Britannica. Encyclopædia Britannica, inc. Accessed May 10, 2021. https://www.britannica.com/science/linguistics/Chomskys-grammar.
Huth, Alexander G., Wendy A. de Heer, Thomas L. Griffiths, Frédéric E. Theunissen, and Jack L. Gallant. “Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex.” Nature 532, no. 7600 (2016): 453–58. https://doi.org/10.1038/nature17637.
Jackendoff, Ray. Patterns in the Mind: Language and Human Nature. New York: BasicBooks, 2010.
Lehrberger, John. Machine Translation: Linguistic Characteristics of MT Systems and General Methodology of Evaluation. Amsterdam: J. Benjamins Publ. Co., 1988.
Lewis-Kraus, Gideon. “The Great A.I. Awakening.” The New York Times Magazine, December 14, 2016. https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html.
Manning, Christopher D., Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. “Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision.” Proceedings of the National Academy of Sciences 117, no. 48 (2020): 30046–54. https://doi.org/10.1073/pnas.1907367117.
Ruder, Sebastian. “A Review of the Recent History of Natural Language Processing.” Sebastian Ruder, January 4, 2021. https://ruder.io/a-review-of-the-recent-history-of-nlp/.
“What Is Machine Translation? Rule Based vs. Statistical Machine Translation.” SYSTRAN. Accessed May 10, 2021. https://www.systransoft.com/systran/translation-technology/what-is-machine-translation/.