Grünwald 2004 (part 3)

2020-06-22

Preliminaries

Codes

a code for a set of objects \(\mathcal{X}\) assigns each object \(x \in \mathcal{X}\) a unique binary string of length at least one
Given a code \(C\), \({C}(x)\) refers to the binary string encoding the object \(x\)
As MDL is concerned with the length of things, it is helpful to have a concise notation for that: \(L_{{C}}(x)\) refers to the length of the binary string coding \(x\). In symbols: \(L_{{C}}(x) = \left| {C}(x)\right|\).
as each object has a unique code, given a code, you can ask which object it encodes. For \(z\) a binary string, we write \({C}^{-1}(z)\) to refer to the object \(x\) (if any) that \(z\) is an encoding of.

Probability

a probability distribution \({P}\) is easiest to conceptualize over finite sets of objects. It specifies for each element of the set, its probability \(P(x)\), which is a number between 0 and 1 indicating how likely that object is to be chosen at random from its set.
If \(P(x) = .75\), then \(x\) will be chosen 75% of the time. As you can’t have more than 100% of something (except hard work), if you add up all of the probabilities assigned to the elements of the set, it must come out equal to 1 (= 100%); in symbols \(\Sigma_{x \in \mathcal{X}} P(x) = 1\)
If adding up the probabilites sums to less than 1, the probability distribution is called defective.
If you have a probability distribution over a set of pairs \(\mathcal{X}\times\mathcal{Y}\), you assign probabilities to pairs of objects at a time: \(P(x,y)\). You can obtain from this a probability distribution over just the elements of \(\mathcal{X}\) by a process called marginalization, which ‘sums up’ the probabilities of all the pairs that an object \(x \in \mathcal{X}\) occurs in: \[\mathcal{P}_{\mathcal{X}}(x) = \Sigma_{y\in Y}P(x,y) = P(x,y_{1}) + P(x,y_{2}) + \ldots\]
A probabilistic source \(P^{\infty} = \langle P^{1},P^{2},\ldots\rangle\) is a family of probability distributions over finite sequences, with \(P^{i}\) being a distribution over sequences of length \(i\). The distributions in this sequence are required to be compatible with one another, in the sense that each is the marginalization of the one which follows it: \(P^{i}(x_{1},\ldots,x_{i}) = \Sigma_{x \in \mathcal{X}}P^{i+1}(x_{1},\ldots,x_{i},x)\)
A probabilistic source can be imagined to be a distribution over infinite sequences. And a distribution over infinite sequences can be imagined to be a static description of a process of choosing an object at random without ever stopping. In this case, a finite sequence reflects one’s choices at a particular point in time. One can imagine later choices being influenced by earlier choices. If this is not the case, the choices are said to be independently and identically distributed (iid). If the data are iid, then the probability of any finite sequence of choices is the product of each choice’s independent probability: \[P(x_{1},\ldots,x_{n}) = \Pi_{i=1}^{n}P(x_{i}) = P(x_{1})\times \dots \times P(x_{n})\]

Prefix Codes

It is useful to have codes for objects that can be combined in compositional ways. Since we are interested in sequences of objects, it would be nice if the code for a sequence were compositionally related to the codes for the elements making up the sequence.

One way we could do this would be to encode commas (i.e. separators) so that we could keep track of when the code for one object ended, and another object began. However, the code for commas needs to be chosen so that it does not overlap at all with the codes used for objects, otherwise, we might read a code, and not know whether that part in the middle is a comma, or just part of the code for a single object.

If all codes for objects had the same length, we could just leave the commas out alltogether, as we would be able to figure out where one object began and the other ended. (I’m here imagining that the code for a sequence of objects is just the concatenation of the codes for the individual objects.) However, this only works if we have a finite number of objects.

The crucial part of this scheme is not in fact the sameness of length, but rather that no code is a prefix of any other code. If we allow the lengths of code words to vary, but require that no code word is a prefix (an initial segment) of any other, we can still figure out where the codeword for one object ends and the next one begins.

All codes used in this tutorial are assumed to have this property.

A prefix code is complete just in case no other prefix code assigns a shorter code word to some object, while not increasing the sizes of the other objects' code words. In other words, a complete prefix code is one which is optimal in terms of the lengths of the code words it uses.

The Kraft inequality

Note that for each length \(l\), there are \(2^{l}\) binary strings of that length. (There are 2 possibilities for each position, and \(l\) positions.)

length	strings	\(2^{-\mathrm{length}}\)
1	0,1	1/2
2	00,01,10,11	1/4
3	000,001,010,011,100,101,110,111	1/8

Thus, a single word of length \(l\) is just \(\frac{1}{2^{l}} = 2^{{-l}}\) of all the words of that length. We can think of the number \(2^{-l}\) as representing the fraction of words of length \(l\) that a single word of that length takes up. Note that if you have 14 objects in toto, you can encode two of them with strings of length 1, four more with strings of length 2, and then the rest with strings of length 3. However, this will not be a prefix code; the code word 0 is a prefix of 6 others. If we want to have a prefix code (which we do), then if we want to use a code word of length 1, we cannot use any longer code words which begin with that same letter. Thus, if we want to encode some object with the word 0, we cannot encode another object with the word 00. Thus, we can think of short codewords as being expensive - in order to buy them, we have to ‘pay’ with a lot of other code words (that we can then use no more). If we have 8 objects that we must encode, we can buy a short code word for one of them, at the cost of longer code words for others: if object a is encoded as the short code word 0, then there are only two words of length 2 and four words of length 3 left over - six all together - which means that one object of the eight needs to be given a code of length 4 or longer. Of course, if we use a code word of length 2 for one of the remaining objects, this blocks two of the remaining four words of length 3 from being used.

The Kraft inequality makes this ‘cost’ intuition about prefix codes precise. You have 1 unit of money that you can spend, and the cost of any given word is related to the fraction of words of that length that it represents: \(2^{-l}\). Thus, if you have \(m\) objects that you would like to encode as binary strings of any length, then you cannot spend more than 1 unit of money: \[\Sigma_{i=1}^{m} 2^{l_{i}} \le 1\]

In our previous example, a short code word of length 1 costs \(2^{-1} = 1/2\) units of money, a ‘middling’ one of length 2 costs \(2^{-2} = 1/4\) units of money, and ‘average’ ones of length 3 each cost \(2^{-3} = 1/8\) units of money. With 8 objects to encode, we see that we could spend all our money on code words of length 3 and be done with it. On the other hand, if we buy an expensive, short word of length 1, we have only \(1/2\) units of money left in which to pay for the codes of the remaining seven words, and so we cannot afford to buy each of the remaining objects an average length 3 code word. Instead, we must purchase longer words of length at least 4, which cost only \(2^{-4} = 1/16\) each. In fact, we must purchase six code words of length 4, at a cost of \(3/8\), in order to offset the cost of the single code word of length 1 (then we will have paid for half of our objects) in order to pay for an average (length 3) code for at least one of our objects. If we only had to encode six objects, we could assign each of them a word of length 3, but we would have money left over (1/4 to be precise). We can buy two short code words of length 2 (at cost 1/2), and then spend the remaining money on four words of length 3. We see that this leads to a more optimal code than the previous choice!

These considerations justify the following:

achieving a short description length for the data is equivalent to identifying the data as belonging to a tiny, very special subset out of all a priori possible data sequences.

Of course, no matter the length \(l\) we choose, \(2^{-l}\) is between 0 and 1, and as we’ve seen, the ‘costs’ of all of our code words together is never greater than one. Thus, the function assigning to each object \(x\) the value \(2^{-L_C(x)}\) is a probability distribution! It is proper (all probabilities add up to exactly 1) whenever the code is complete (i.e. optimal w.r.t. the lengths of the code words it assigns). Moreover, given any probability distribution \(P\), we can create a prefix code \(C_{P}\) which assigns code words to objects whose lengths are the negative logarithms of the objects' probabilities (rounded up). Thus, prefix codes and probabilities are two different ways of talking about the same thing.

Because of the close connection between codes and probabilities, this tutorial now adopts the a priori suprising position that code lengths are what really matter (not particular codes), and from here on will talk primarily about code length functions, not code functions. (Maybe it’s not so surprising - after all, the approach is called minimum description length…) Weirder is that we now allow fractional code lengths - this is only to make the math easier.

The information inequality

The connection between probability distributions and prefix codes discussed above allows us to talk about codes as being for distributions. It would be nice if the optimal code for encoding some data were related to the probability distribution according to which the data is generated. This is what the information inequality states: The expected cost, according to the true distribution \(P\), of encoding the data using a code for \(P\), is lower than that of encoding the data using a code for a different distribution. We talk of expected costs because our data might be unrepresentative (we might end up getting 10 heads in a row by flipping a fair coin).

Miscellenia

this was a bit heavier than what we have read previously, so please ask any questions you might have in the comments.
for next time, start reading section 2.3