Understanding perplexity and its relation to cross-entropy and compression
Modern conditional and unconditional language models often report their perplexity as validation metrics.
Let’s see what it means intuitively and how it connects to other important measures of information theory such as cross entropy or compression.
Definition of perplexity
The perplexity of a sequence of observation is defined as:
$\displaystyle \mathcal{P} = P(x_1, x_2, … , x_N) ^ {- \frac{1}{N}}$
Mathematically, it is the reciprocal of the probability of this sequence to appear in natural language, nornalised so that the number of elements in the sequence does not push this probability to zero as we add more terms (and consequently push the perplexity to infinity as we add more terms).
Back-ground on language models
Language models are trained to model the probability distribution of the next word in the sentence given the previous ones:
$P(x_N | x_{N-1}, …, x_1)$
We call such models auto-regressive. These models are elequant for at least two reasons. The first one is that we can easily sample from them to produce a sentence (write english, code, cooking recipes, etc).
The second is that their formulation corresponds to a specific decomposition of the joint probability of the sequence, which follows directly from the chain rule of probability, and always holds true:
$P(x_1, x_2, … , x_N) = \prod P(x_i | x_{i-1}, …, x_1)$
To train these model we use the standard cross entropy loss, written as so:
$\mathcal{C} = - \frac{1}{N} \sum P(x_i | x_{i-1}, …, x_1)$
Which we can identify as the $\log$ of the joint probability of the sequence. Elequant!
Connecting perplexity to cross entropy
As mentionned above, language models (conditional or not) are typically trained with cross entropy. Let see how perplexity is connected to this cross-entropy.
First, recall the definition the perplexity:
$\displaystyle \mathcal{P} = P(x_1, x_2, … , x_N) ^ {- \frac{1}{N}}$
Now, let’s take the log of the perplexity:
$\displaystyle \log \mathcal{P} = \log P(x_1, x_2, … , x_N) ^ {- \frac{1}{N}} = - \frac{1}{N} \log P(x_1, x_2, … , x_N)$
Next we can decompose the joint probability of the sequence with the chain rule of probability, the same decomposition used in auto-regressive models:
$\displaystyle \log \mathcal{P} = - \frac{1}{N} \log \prod P(x_i | x_{i-1}, …, x_1) = - \frac{1}{N} \sum \log P(x_i | x_{i-1}, …, x_1)$
We recognize here the auto-regressive cross entropy objective. Hence the perplexity is the exponential of the cross entropy of the language model.
Ok, but what does it mean?
In terms of distance
The cross-entropy of a given sequence is a mesure of how far this sequence is from the probablity distribution modelled by the language model (*).
The bigger the cross entropy is, the further away the sequence being tested is from being generated by the language model.
Since the exponential is monotonously increasing, the same holds for the perplexity. The higher the perplexity the worse the language model is at modelling the sequence being evaluted.
Note that this does not directly judge the quality of the sentences being generated by the language model. This is a very important point.
The language model could generate perfectly correct sentences, syntactically perfect and meaningful, and still get a very high perplexity. One extreme example is when the model only generates a single sentence. Its perplexity would be infinite on any other sentence.
(*) In truth, the cross entropy is a measure of distance between two probability distributions, but we take this shortcut here for simplicity.
In terms of choice
Say there is only a single word in our language. The perplexity of the sequence containing this single word would be 1.
$\displaystyle \mathcal{P} = P(x_1) ^ {- \frac{1}{1}} = \frac{1}{P(x_1)} = \frac{1}{1.0} = 1$
If the vocabulary had two words equally probable the perplexity of one of those words would be 2.
$\displaystyle \mathcal{P} = P(x_1) ^ {- \frac{1}{1}} = \frac{1}{P(x_1)} = \frac{1}{0.5} = 2$
Intuitively, the perplexity of a sentence corresponds to the average numbers of samples the language models would have to draw per word in order to generate this sentence. Or said differently, the average of the number of choices the language would have faced at each word in order to generate this sentence.
The higher this number, the less likely the language model would have to been to generate the given sentence. Again, note the inversion: we don’t judge the quality of generated text by the language model as much as the generality of the language model.
In terms of information content and compression
The cross-entropy of a given sequence is a mesure of how many bits are needed in order to encode this sentence given the probability distribution defined by the language model.
The higher the cross entropy is (equivalently the higher the perplexity is), the less the sentence can be compressed by the language model.
In this sense, perplexity and cross-entropy are a measure of compressibility of natural language text under the probability distribution defined by the language model.
A perfect language model would be able to compress natural language with the least amout of bits, thanks to its perfect modelling of the joint probability distribution of language.
Comments