This paper is actually only my notes of presentation “A Formal Perspective On Language Models”, given by Ryan Cotterell, assistant professor at ETH Zürich in the Department of Computer Science, at National Science Foundation event Workshop on LLMs, Language Structure, and the Cognitive and Neural Basis of Language on May 13, 2024.

In this small exercise we aim to explore the limits of Large Language Models (LLMs) from the intersecting perspectives of linguistics and computational theory.

## The basics of the theory of computation

Let’s start from the basics—the most fundamental definition in the theory of computation, which is the definition of an alphabet, often denoted as \(\Sigma\).

Theory of computation is entirely based on symbols, generally letters and digits. Alphabets are defined as finite sets of symbols. For example, \(\Sigma = \{0, 1\}\) is an alphabet of binary digits.

An alphabet is a finite, non-empty set. This is a straightforward concept.

Now we need to address a concept that isn’t typically part of linguistics, though it is a significant step towards it: Kleene closure. Kleene closure is the set of all strings of finite length made up of elements from a given set (in our case, an alphabet). (The term “closure” refers to a set being “closed” under an operation. For strings, this means that concatenating any strings in the set results in a string that remains within the set.)

So, the Kleene closure is usually denoted as \(\Sigma^*\), and it represents the totality of all possible combinations of alphabet elements.

The Kleene closure of an alphabet \(\Sigma\) is defined as the infinite union of all strings of length \(n\) over \(\Sigma\), where \(n \geq 0\) – starting from zero (the empty string). Formally, it is expressed as: \[ \Sigma^* = \bigcup_{n=0}^{\infty} \Sigma^n \]

Here, \(\Sigma^n\) denotes the set of all strings of length \(n\) that can be formed using the symbols from \(\Sigma\).

An interesting fact, obvious to a mathematician but less so to a data scientist, is that this infinite set is composed exclusively of finite elements.

This fact will play a significant role later on.

## Language

Now, let’s define Language. Language is a fundamental concept in the theory of computation. Languages are used to study and describe the behavior of computational models like finite automata, context-free grammars, and Turing machines.

In these terms, a language “over an alphabet \(\Sigma\)” is defined as a subset of the set of all possible strings (including the empty string) that can be formed using the symbols from \(\Sigma\). So, a language is a subset of \(\Sigma^*\): \[ L \subseteq \Sigma^* \]

A language can be finite or infinite, depending on the number of strings it contains. It can also be defined by specific rules or patterns, such as regular expressions, context-free grammars, or other formal systems.

These three definitions, especially the last one, provide you with everything you need to know about the theory of computation at a high level. The theory of computation is about finding devices that determine whether or not a string belongs to a language.

All the famous devices, such as finite state automata, context-free grammars, and Turing machines, are designed to determine whether a string belongs to a language.

## Language model

But let’s get right into language models. Nowadays, many people are interested in proving the formal properties of language models, exploring what these models can or cannot do. For example, there is interest in predicting how soon AGI (Artificial General Intelligence) might appear, perhaps with the advent of GPT-7 or something like that.

If you ever wanted to prove a theorem about a mathematical object, you’d have to start with a definition. So, the definition of what a language model is—and this is historically accurate—is essential.

So, the question now is: What is a language model? This is going to be crucially important.

A language model is a probability distribution over \(\Sigma^*\) for some alphabet \(\Sigma\).

This is a complete generalization of what we did with formal language theory, because you can always recover a formal language by looking at the support.

In this context, “support” refers to the set of all elements (strings) in the alphabet \(\Sigma\) for which the language model assigns a non-zero probability.

Formally, if \(m\) is a language model over \(\Sigma^*\), the support of \(m\), denoted as \(\text{supp}(m)\), is defined as:

\[ \text{supp}(m) = \{ w \in \Sigma^* \mid m(w) > 0 \} \]

This means \(\text{supp}(m)\) includes all the strings \(w\) that have a non-zero probability under the language model \(m\).

## Generator always halts

Now, the obvious thing that follows from this is that if I define a language model as a distribution over \(\Sigma^*\), and \(\Sigma^*\) only contains finite-length strings, then generating a string with probability one means that I always sample a string of finite length, which means I always halt.

This means that **no language model is Turing complete**.

## No language model is general-purpose reasoner

Which is a very obvious fact if you think about it. In fact, it follows completely from the definition of a language model above. If no language model is Turing complete, then no language model is a general-purpose reasoner. That’s just a fact now for you, I hope.

Turing completeness fundamentally relies on the fact that the model might not halt. That’s the only way you can possibly get a Turing machine. The process might have to run forever, but that means it has to generate something outside of \(\Sigma^*\), which is impossible for a language model.

So what does this mean? If you continue the chain of reasoning, it means that every big transformer, no matter how large it is, stops with probability one, which means it can’t emulate a Turing machine. Consequently, it can’t perform general-purpose reasoning. Moreover, this conclusion depends on the assumptions you make.

This means that not only are Transformers not Turing complete, but they’re also much farther down the Chomsky hierarchy.

That’s how their non-Turing complete nature prevents these models from emulating the desiderata of AGI, which is often touted as having immense reasoning power.