Auto-regressive language models don’t necessarily sample the most probable sentences
The year 2020 saw the coming of GPT-3. Since then, a lot of increasingly powerful language or language-image models (sur as Chinchilla or Flamingo) have been released.
What these models have in common is the auto-regressive nature of their sequence generation process. Each word is generated by conditioning on the previous words.
Although very powerful, these models, even if sampled in a greedy fashion (i.e. pick the next most probable word as next token), will not necessarily generate the most probable sentence or answer. Let’s see why.
Back-ground on language models
Language models are trained to model the probability distribution of the next word in the sentence given the previous ones:
$P(x_N | x_{N-1}, …, x_1)$
We call such models auto-regressive. These models are elequant for at least two reasons. The first one is that we can easily sample from them to produce a sentence (write english, code, cooking recipes, etc).
The second is that their formulation corresponds to a specific decomposition of the joint probability of the sequence, which follows directly from the chain rule of probability, and always holds true:
$P(x_1, x_2, … , x_N) = \prod P(x_i | x_{i-1}, …, x_1)$
Greedy does not lead to most probable
Say our language model manages to perfectly match the conditional distribution of next word in the natural language.
$P(x_N | x_{N-1}, …, x_1)$
We now decide to generate the next sentence by repeatedly sampling the next most probable token from this distribution until the end of sequence token is reached. Why are we not guaranted to generate the most probable sentence?
Let’s start with an example
In the example below, the prefix represents what our auto-regressive would have generated so far, and provide the next tokens with associated probabilities:
Prefix | Next token | Probability |
---|---|---|
A | B | 60% |
A | C | 40% |
AB | A | 33% |
AB | B | 33% |
AB | C | 33% |
AC | C | 100% |
Say we are given the prefix “A” (which for example represents the question being asked to our language model) and we want to generate 2 tokens.
By greedy behavior, our model will select “B” as next token (probability 60%). Then it will be cornered and have to pick a token with probability 33%. The overall probability of the answer will be 20%.
But the most probable sentence to generate in that case is “CC” (first token with probability 40% then second with 100%) leading to an overall probability of 40%, much higher than what the greedy algorithm would have done.
But why?
As discussed above, the probability of a sentence of $N$ tokens can be decomposed using the chain rule of probability like so:
$P(x_1, x_2, … , x_N) = \prod P(x_i | x_{i-1}, …, x_1)$
In the general case, maximizing this quantity cannot be done by successfully maximizing the quantity with regard to any of the variables in isolation (even by selecting a “clever” ordering of variables with regard to which to maxinize).
This is one of the reason why it is so important not to sample in a greedy manner. But to me, there is another intuitive reason as well…
Greedy is boring
The greedy sampling of the next probable sentence is likely to select the most boring next possible word.
Remember from information theory that the information content $\mathcal{I}$ of an event $X$ is the logarithm of the reciprocal of the probability:
$\mathcal{I(X)} = - \log P(X)$
A very probable event conveys a very low amount of information when it occurs. Telling you that the sun will rise and fall tomorrow is quite boring. Telling you that a comet will appear in the sky tomorrow is much more interesting. Adding the time in the sentence will add some more information and makes it even more interesting.
In short, if you want an interesting conversation, the minimum requirement would be surprise and information content. Those are reason why greedy should be avoided.
So should we instead generate the next word with the smallest probability? It’s obviously not a good idea either. If this next word has very low probability it is likely that adding it to the end of the sentence is not gramatically correct.
The solution used in most paper is to sample from the distribution using a temperature in the softmax so that a reasonnable diversity is achieved, without endangering the syntactic correctness of the sentence.
Links to similar phenomena
For those familiar with Hidden Markov Models (HMM), this phenomena is very reminiscent of the difference between “filtering” and the “Viterbi” algorithm.
Given a sequence of observations, the filtering algorithm will give you the most probable current state. But repeatedly “filtering” on very position in the sequence will not given you the most probable succession of states, to do so, you must use the Viterbi algorithm.
Comments