Grünwald 2004 (part 5)

2020-07-07

The philosophy of MDL provides a rigorous method to choose between competing theories of some data: choose the simplest theory that describes the data in the simplest way. ‘Crude’ MDL uses a two-part coding scheme to encode data and hypothesis:

the hypothesis is encoded
the data is encoded by means of the hypothesis

In order to carry this out, we need two families of encoding schemes:

a scheme \(C_{1}\) that tells us how to encode hypotheses
a scheme \(C_{2,H}\), for each hypothesis \(H\), that tells us how to encode data given hypothesis \(H\)

Then the best hypothesis given some data \(D\) is one which mininimizes the sum of \(C_{1}(H) + C_{2,H}(D)\).

Recall our discussion of markov chains: a markov chain of order k can be viewed as a collection of \(2^{k}\) probability distributions over the alphabet \(\{0,1\}\). Each probability distribution assigns a probability to seeing a particular letter \(i\) after a length k sequence of previous letters. A more ‘intuitive’ way of visualizing a markov-chain of order k is as a finite state machine where each state represents the k previous letters seen. Each state represents one of the probability distributions.

How might we provide a general scheme to encode markov chains of arbitrary order? One natural idea is to first specify the order k of a given markov chain, and then to specify each of the \(2^{k}\) probability distributions that constitute it. As each probability distribution is a bernoulli distribution (a possibly biased coin flip between letters 0 and 1), a distribution requires exactly one value \(\theta\in [0,1]\) - the probability of a 0 - to specify. So if we are given a number k, and then \(2^{k}\) subsequent parameters \(\theta_{1},\dots,\theta_{k}\) representing the probability distributions in some canonical order - say the alphabetical order of the length k sequence of previously letters that determines it - we have unambiguously specified a markov chain.

This is how to encode an arbitrary markov chain. Note that, even when we fix k, the length of the encoding of the markov chain might be arbitrarily long - any (or all) of the parameters \(\theta_{i}\) could be arbitrarily complex: \(\theta_{i}\) could be a perfectly random (i.e. incompressible) sequence of digits (after the decimal point). However, given the data \(x_{1},\dots,x_{n}\) that we want to encode, we can indeed put an upper bound on the length of the encoding of the order k markov chain which maximizes the likelihood of this data: Recall that the maximum likelihood order k markov chain for the data sets each parameter \(\theta_{b|b_{1}\dots b_{k}}\) ‘empirically’, as the ratio of seeing the string \(b_{1}\dots b_{k}b\) to seeing its prefix \(b_{1}\dots b_{k}\). Thus we do not need to encode \(\theta_{b|b_{1}\dots b_{k}}\), but can get away with just encoding the number of times we have seen \(b_{1}\dots b_{k}0\) and \(b_{1}\dots b_{k}1\). From these counts, which must be less than the sum of the lengths of the words in the data, we can recover \(\theta_{b|b_{1}\dots b_{k}}\). Moreover, the counts \(n_{1|b_{1}\dots b_{k}}\) are related to eachother - the text says that “it is not hard to see” that we can recover the \(n_{0|b_{1}\dots b_{k}}\) counts from the \(n_{1|b_{1}\dots b_{k}}\) counts. I don’t see exactly how to do this myself at the moment. At any rate, assuming that the text is right, given the data, we need to encode \(2^{k}\) integers each of which is no greater than the sum of the lengths of the words in the data. This allows us to compute an upper bound on the length of the encoding of the maximum likelihood markov chain for the data. It’s sort of interesting the data plays a role in bounding the size of the best hypothesis!

Because of the deep connection between encoding and probability we discussed previously, we can speak in a very non-constructive¹ way about the length of the encoding of the data given the hypothesis without speaking about the identity of the code itself. It is certainly more initially insightful to see how codes work! Given a sequence \(00010\) which might be our \(3^{rd}\) data point, we need to describe it exactly. We can do this by simply listing the letters it contains. This however doesn’t make use of our markov chain-shaped hypothesis! Unfortunately, given that we have a binary alphabet, we already have a prefix code, and given that we cannot use less than 1 symbol, we cannot in a simple way take advantage of the underlying markov chain to compress the word. Ideally, we would use \(- log_{2}\ \theta_{1}\) symbols to encode a 1, and \(- log_{2} (1 - \theta_{1})\) symbols to encode a 0. So if our markov chain (in state \(i\)) assigned \(\theta_{1} = .8\) chance of seeing a 1, we would use .32 symbols to encode this 1, and 2.32 symbols to encode a 0. But of course, it’s hard to use .32 symbols!

By this I do not mean ‘unhelpful’, but rather ‘without telling you how to build what it is you’re describing.’ ↩︎