Grünwald 2004 (part 2)

The basic idea behind MDL is summarized below.

MDL: The basic idea

The goal of statistical inference may be cast as trying to find regularity in the data. ‘Regularity’ may be identified with ‘ability to compress’. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most.

Next follows an example of finding a function (polynomial in x) to some data. We can think of this as the results of a psycholinguistic experiment, measuring how reading time (dependent variable) depends on cognitive load (independent variable). A linear regression will find the function of the form \(f(x) = mx + b\), a straight line, that best fits the data. This will generally not fit the data very well, if reading time is not linearly dependent on cognitive load. On the other extreme, if we have n data points, we can find a function of the form \(f(x) = a_{n}x^{n-1} + a_{n-1}x^{n-2} + \dots + a_{2}x^{1} + a_{1}\) which will exactly fit the data given. However:

Yet if one tests the inferred polynomial on a second set of data coming from the same source, it typically fits this test data very badly in the sense that there is a large distance between the polynomial and the new data points. We say that the polynomial overfits the data.

This example shows us the trade-off between simplicity: a straight line is very simple compared to a polynomial of high degree, and goodness of fit: the straight line is a poor match to the given data, the high degree polynomial an exact match. Crucially, neither extreme is a good predictor of future data:

Indeed, all model selection methods that are used in practice either implicitly or explicitly choose a trade-off between goodness-of-fit and complexity of the models involved. In practice, such trade-offs lead to much better predictions of test data than one would get by adopting the ‘simplest’ (one degree) or most ‘complex’ (n − 1-degree) polynomial. MDL provides one particular means of achieving such a trade-off.

We next introduce the distinction between models and (point) hypotheses:

We use the phrase point hypothesis to refer to a single probability distribution or function. An example is the polynomial \(5x^{2}+4x+3\). A point hypothesis is also known as a ‘simple hypothesis’ in the statistical literature. We use the word model to refer to a family (set) of probability distributions or functions with the same functional form. An example is the set of all second- degree polynomials. A model is also known as a ‘composite hypothesis’ in the statistical literature. We use hypothesis as a generic term, referring to both point hypotheses and models.

The curve-fitting example just described can have different goals. It is a ‘hypothesis selection problem’ if we are trying to both identify the model (the degree of the polynomial we want to use) as well as the point hypothesis (the actual function to describe the data). It is a ‘model selection problem’ if we just want to identify which class of polynomials the optimal solution is in.

Now we are presented with an explicit formulation of the MDL idea:

Crude, Two-part Version of MDL Principle (Informally Stated)

Let \(H^{(1)}\), \(H^{(2)}\), \ldots be a list of candidate models (e.g., \(H^{(k)}\) is the set of k-th degree polynomials), each containing a set of point hypotheses (e.g., individual polynomials). The best point hypothesis \(H \in H^{(1)}\cup H^{(2)} \cup \dots\) to explain the data \(D\) is the one which minimizes the sum \(L(H) + L(D|H)\), where

  • \(L(H)\) is the length, in bits, of the description of the hypothesis; and
  • \(L(D|H)\) is the length, in bits, of the description of the data when encoded with the help of the hypothesis.

The best model to explain \(D\) is the smallest model containing the selected \(H\).

The idea is that we are supposed to write down exactly the data that we saw. We can do this in any way we want. Let us for concreteness say that our data is the sequence of pairs \(D = (1,1),(5,5),(6,8),(3,2),(7,13)\). One way is to just write down the data as we have just done. However, doing this requires us to have made decisions about how to represent our data. For example, here we have used decimal notation. In MDL it is standard to formulate things in binary, and our data in binary look different: \(D_{bin} = (1,1),(101,101),(110,1000),(11,10),(111,1101)\). We want to be measuring how much we have to write down, and a good metric of this is length; we can just count symbols.1 In the decimal representation of our data, there are 19 punctuation symbols (commas and or parentheses) and 11 digits, making a grand total of 30 symbols. In the binary representation of our data there are the same number of punctuation symbols, but now 26 digits, making for 45 symbols.

However, the longer the numbers, the more space it takes to write them down. If we have a hypothesis \(H\), say \(f(x) = 2x - 2\), we can use this hypothesis to shrink the description of our data; instead of just writing the second member of each pair, we can write how much it diverges from our hypothesis: we write instead \((x,y - f(x))\). The data given our hypothesis now look like the following: \(D | H = (1,1),(5,-3),(6,-2),(3,0),(7,1)\). Or in binary: \((D|H)_{bin} = (1,1),(101,-11),(110,-10),(11,0),(111,1)\). Now our decimal representation of the data given the hypothesis has 21 punctuation symbols (comma, parenthesis, and negative sign) and 10 digits making for a grand total of 31 symbols, and our binary representation of the same has 19 digits, making for a grand total of 40 symbols.2 However, the reduction in size of the data comes from using a particular point hypothesis - we have essentially used the hypothesis as a cipher for encoding the data. We need to write down this hypothesis when we present the data, so that the data can be decoded. Since we assume that we will be using polynomials, we have to specify the model, which in this case is the family of degree one polynomials, and then the point hypothesis in that model. We can specify the model by writing down a degree, in our case ‘1’, and we can specify the point hypothesis by writing down the coefficients, in this case 2 and -2. Thus, our description of our hypothesis is the three numbers \(H = 1,2,-2\) (three punctuation symbols and three digits), or in binary \(H_{bin} = 1,10,-10\) (three punctuation symbols and five digits).

We can measure the cost of encoding the data with a hypothesis by adding up the lengths of specifying the hypothesis and the data given the hypothesis, which is 37 in decimal, and 48 in binary. We see that in this case, the benefit of using a hypothesis to encode the data was not enough to justify its use in this case (encoding the data w/o a hypothesis was just 30 in decimal and 45 in binary). This was mostly due to to the small number of data points - once we have enough data, the cost of encoding a hypothesis is generally small relative to the cost of encoding the data.

Grünwald gives a general overview of this example:

In our previous example, the candidate hypotheses were polynomials. We can describe a polynomial by describing its coefficients in a certain precision (number of bits per parameter). Thus, the higher the degree of a polynomial or the precision, the more bits we need to describe it and the more ‘complex’ it becomes. A description of the data ‘with the help of’ a hypothesis means that the better the hypothesis fits the data, the shorter the description will be. A hypothesis that fits the data well gives us a lot of information about the data. Such information can always be used to compress the data (Section 2.2). Intuitively, this is because we only have to code the errors the hypothesis makes on the data rather than the full data. In our polynomial example, the better a polynomial \(H\) fits D, the fewer bits we need to encode the discrepancies between the actual y-values \(y_{i}\) and the predicted y-values \(H(x_{i})\). We can typically find a very complex point hypothesis (large \(L(H)\)) with a very good fit (small \(L(D|H)\)). We can also typically find a very simple point hypothesis (small \(L(H)\)) with a rather bad fit (large \(L(D|H)\)). The sum of the two description lengths will be minimized at a hypothesis that is quite (but not too) ‘simple’, with a good (but not perfect) fit.

I discussed (very briefly) above that one needed to choose a representation for our data and our hypotheses. Now we turn to a more detailed discussion of this.

Crude MDL picks the \(H\) minimizing the sum \(L(H)+L(D|H)\). To make this procedure well-defined, we need to agree on precise definitions for the codes (description methods) giving rise to lengths \(L(D|H)\) and \(L(H)\). We now discuss these codes in more detail. We will see that the definition of \(L(H)\) is problematic, indicating that we somehow need to ‘refine’ our crude MDL Principle.

Definition of L(D|H):

Consider a two-part code as described above, and assume for the time being that all \(H\) under consideration define probability distributions. If \(H\) is a polynomial, we can turn it into a distribution by making the additional assumption that the Y -values are given by \(Y = H (X ) + Z\), where \(Z\) is a normally distributed noise term. For each \(H\) we need to define a code with length \(L(\cdot | H)\) such that \(L(D|H)\) can be interpreted as ‘the codelength of \(D\) when encoded with the help of \(H\)'. It turns out that for probabilistic hypotheses, there is only one reasonable choice for this code. It is the so-called Shannon-Fano code, satisfying, for all data sequences \(D\), \(L(D|H) = −log P(D|H)\), where \(P(D|H)\) is the probability mass or density of \(D\) according to \(H\) – such a code always exists, Section 2.2.

Where did these probability distributions come from?! The result that Grünwald wants can be stated best if our theories are probabilistic. As it happens, it is straightforward to turn a non-probabilistic theory into a probabilistic one - just add noise. More interestingly, there is a best way to represent the data given the hypothesis, which relates the length of encoding the data given the hypothesis to the probability of the data given the hypothesis.

The real problem is how to decide how best to encode hypotheses! The problem is that there are many arbitrary decisions we can make that will result in the same hypotheses being of different lengths. According to one arbitrary decision, maybe \(H_{1}\) is the optimal hypothesis, under another arbitrary decision, suddenly \(H_{2}\) is better.

Definition of \(L(H)\): A Problem for Crude MDL

It is more problematic to find a good code for hypotheses \(H\). Some authors have simply used ‘intuitively reasonable’ codes in the past, but this is not satisfactory: since the description length \(L(H)\) of any fixed point hypothesis \(H\) can be very large under one code, but quite short under another, our procedure is in danger of becoming arbitrary. Instead, we need some additional principle for designing a code for H.

The solution to this problem is to adopt a sort of ‘Bayesian’ perspective, which is to use all of the point hypothesis in a model! The idea is described below, but we will have to wait before seeing it made precise.

Refined MDL

In refined MDL, we associate a code for encoding \(D\) not with a single \(H \in \mathcal{H}\), but with the full model \(\mathcal{H}\). Thus, given model \(\mathcal{H}\), we encode data not in two parts but we design a single one-part code with lengths \(\overline{L}(D|\mathcal{H})\). This code is designed such that whenever there is a member of (parameter in) \(H \in \mathcal{H}\) that fits the data well, in the sense that \(L(D | H)\) is small, then the codelength \(\overline{L}(D|\mathcal{H})\) will also be small. Codes with this property are called universal codes in the information-theoretic literature [Barron, Rissanen, and Yu 1998]. Among all such universal codes, we pick the one that is minimax optimal in a sense made precise in Section 2.5. For example, the set \(\mathcal{H}^{(3)}\) of third-degree polynomials is associated with a code with lengths \(\overline{L}(\cdot | \mathcal{H}^{(3)})\) such that, the better the data D are fit by the best-fitting third-degree polynomial, the shorter the codelength \(\overline{L}(D | \mathcal{H})\). \(\overline{L}(D|\mathcal{H})\) is called the stochastic complexity of the data given the model.

Here finding the right class of models becomes very important. Also important is the distinction between models and parameters…

Parametric Complexity

The second fundamental concept of refined MDL is the parametric complexity of a parametric model \(\mathcal{H}\) which we denote by \(COMP(\mathcal{H})\). This is a measure of the ‘richness’ of model \(\(H\)\), indicating its ability to fit random data. This complexity is related to the degrees-of-freedom in \(\mathcal{H}\), but also to the geometrical structure of \(\mathcal{H}\); see Example 1.4.

The parametric complexity of a model (i.e. set of hypotheses) measures how flexible that model is - if your model could account for lots of different possible data, it is more flexible than one which could only account for small variations in the actual data. The stochastic complexity of the data given the model is defined as the sum of the length of the description of \(D\) under the best hypothesis in \(\mathcal{H}\) with the parametric complexity of \(\mathcal{H}\).

model selection between two parametric models (such as the models of first and second degree polynomials) now proceeds by selecting the model such that the stochastic complexity of the given data \(D\) is smallest. Although we used a one-part code to encode data, refined MDL model selection still involves a trade-off between two terms: a goodness-of-fit term [the length of the description of the data under the best hypothesis in the model] and a complexity term [the parametric complexity of the model]. However, because we do not explicitly encode hypotheses \(H\) any more, there is no arbitrariness any more.

So we can get away without describing hypothesis by focussing instead on models. We can think about this in a number of different ways:

The resulting procedure can be interpreted in several different ways, some of which provide us with rationales for MDL beyond the pure coding interpretation (Sections 2.6.1–2.6.4):

Counting/differential geometric interpretation
The parametric complexity of a model is the logarithm of the number of essentially different, distinguishable point hypotheses within the model.
Two-part code interpretation
For large samples, the stochastic complexity can be interpreted as a two-part codelength of the data after all, where hypotheses \(H\) are encoded with a special code that works by first discretizing the model space \(\mathcal{H}\) into a set of ‘maximally distinguishable hypotheses’, and then assigning equal codelength to each of these.
Bayesian interpretation
In many cases, refined MDL model selection coincides with Bayes factor model selection based on a non-informative prior such as Jeffreys’ prior [Bernardo and Smith 1994].
Prequential interpretation
Refined MDL model selection can be interpreted as selecting the model with the best predictive performance when sequentially predicting unseen test data, in the sense described in Section 2.6.4. This makes it an instance of Dawid’s [1984] prequential model validation and also relates it to cross-validation methods.

We now turn to an adumbration of basic philosphical tenets of MDL.

Regularity as Compression
According to Rissanen, the goal of inductive inference should be to ‘squeeze out as much regularity as possible’ from the given data. The main task for statistical inference is to distill the meaningful information present in the data, i.e. to separate structure (interpreted as the regularity, the ‘meaningful information’) from noise (interpreted as the ‘accidental information’). For the three sequences of Example 1.2, this would amount to the following: the first sequence would be considered as entirely regular and ‘noiseless’. The second sequence would be considered as entirely random - all information in the sequence is accidental, there is no structure present. In the third sequence, the structural part would (roughly) be the pattern that 4 times as many 0s than 1s occur; given this regularity, the description of exactly which of all sequences with four times as many 0s than 1s occurs, is the accidental information.

This might take some getting used to, but it seems sort of intuitive. It’s at least very close to linguistic practice, where people (especially in Distributed Morphology) find ways of describing any conceivable regularity in the data.

Models as Languages
Rissanen interprets models (sets of hypotheses) as nothing more than languages for describing useful properties of the data – a model \(\mathcal{H}\) is identified with its corresponding universal code \(\overline{L}(\cdot | \mathcal{H})\). Different individual hypotheses within the models express different regularities in the data, and may simply be regarded as statistics, that is, summaries of certain regularities in the data. These regularities are present and meaningful independently of whether some \(H_{\ast}\in \mathcal{H}\) is the ‘true state of nature’ or not. Suppose that the model \(\mathcal{H}\) under consideration is probabilistic. In traditional theories, one typically assumes that some \(P_{\ast}\in \mathcal{H}\) generates the data, and then ‘noise’ is defined as a random quantity relative to this \(P_{\ast}\). In the MDL view ‘noise’ is defined relative to the model \(\mathcal{H}\) as the residual number of bits needed to encode the data once the model \(\mathcal{H}\) is given. Thus, noise is not a random variable: it is a function only of the chosen model and the actually observed data. Indeed, there is no place for a ‘true distribution’ or a ‘true state of nature’ in this view – there are only models and data. To bring out the difference to the ordinary statistical viewpoint, consider the phrase ‘these experimental data are quite noisy’. According to a traditional interpretation, such a statement means that the data were generated by a distribution with high variance. According to the MDL philosophy, such a phrase means only that the data are not compressible with the currently hypothesized model – as a matter of principle, it can never be ruled out that there exists a different model under which the data are very compressible (not noisy) after all!

This is less intuitive, perhaps because as linguists we are convinced that there is in fact a true grammar out there for us to discover. Rissanen’s philosophical position here is an anti-realist one: we are trying to come up with a model which describes/predicts data. Is it true? What does that mean, beyond ‘it describes and predicts data well’?

We Have Only the Data
Many (but not all) other methods of inductive inference are based on the idea that there exists some ‘true state of nature’, typically a distribution assumed to lie in some model \(\mathcal{H}\). The methods are then designed as a means to identify or approximate this state of nature based on as little data as possible. According to Rissanen, such methods are fundamentally flawed. The main reason is that the methods are designed under the assumption that the true state of nature is in the assumed model \(\mathcal{H}\), which is often not the case. Therefore, such methods only admit a clear interpretation under assumptions that are typically violated in practice. Many cherished statistical methods are designed in this way we mention hypothesis testing, minimum-variance unbiased estimation, several non-parametric methods, and even some forms of Bayesian inference – see Example 2.22. In contrast, MDL has a clear interpretation which depends only on the data, and not on the assumption of any underlying ‘state of nature’.

Because MDL is defined in the way it is, it does not presuppose that the truth is out there (i.e. that the true hypothesis is in our model). This is reflected in redefining learning as compressing the data: if fiction provides a better account of the data than the truth, we will arrive at this fiction.

Here is an example of useful fictions:

Example 1.5 [Models that are Wrong, yet Useful] Even though the models under consideration are often wrong, they can nevertheless be very useful. Examples are the successful ‘Naive Bayes’ model for spam filtering, Hidden Markov Models for speech recognition (is speech a stationary ergodic process? probably not), and the use of linear models in econometrics and psychology. Since these models are evidently wrong, it seems strange to base inferences on them using methods that are designed under the assumption that they contain the true distribution. To be fair, we should add that domains such as spam filtering and speech recognition are not what the fathers of modern statistics had in mind when they designed their procedures – they were usually thinking about much simpler domains, where the assumption that some distribution \(P_{\ast}\in \mathcal{H}\) is ‘true’ may not be so unreasonable.

MDL and Consistency
Let \(\mathcal{H}\) be a probabilistic model, such that each \(P \in\mathcal{H}\) is a probability distribution. Roughly, a statistical procedure is called consistent relative to \(H\) if, for all \(P_{\ast}\in \mathcal{H}\), the following holds: suppose data are distributed according to \(P_{\ast}\). Then given enough data, the learning method will learn a good approximation of \(P_{\∗}\) with high probability. Many traditional statistical methods have been designed with consistency in mind (Section 2.3).

in other words, a learning procedure is consistent just in case the learner will learn a good approximation of the true grammar if given enough representative data.

But wait, MDL doesn’t assume that there is a true grammar…

The fact that in MDL, we do not assume a true distribution may suggest that we do not care about statistical consistency. But this is not the case: we would still like our statistical method to be such that in the idealized case where one of the distributions in one of the models under consideration actually generates the data, our method is able to identify this distribution, given enough data. If even in the idealized special case where a ‘truth’ exists within our models, the method fails to learn it, then we certainly cannot trust it to do something reasonable in the more general case, where there may not be a ‘true distribution’ underlying the data at all. So: consistency is important in the MDL philosophy, but it is used as a sanity check (for a method that has been developed without making distributional assumptions) rather than as a design principle.

Ok, well, that’s reassuring.

Next we turn to the idea that MDL chooses the simpler of two models that fit the data equally well.

When two models fit the data equally well, MDL will choose the one that is the ‘simplest’ in the sense that it allows for a shorter description of the data. As such, it implements a precise form of Occam’s Razor – even though as more and more data becomes available, the model selected by MDL may become more and more ‘complex’! Occam’s Razor is sometimes criticized for being either (1) arbitrary or (2) false [Webb 1996; Domingos 1999]. Do these criticisms apply to MDL as well?

The first criticism is based on the idea that simplicity is in the eye of the beholder, and is thus arbitrary.

‘1. Occam’s Razor (and MDL) is arbitrary’
Because ‘description length’ is a syntactic notion it may seem that MDL selects an arbitrary model: different codes would have led to different description lengths, and therefore, to different models. By changing the encoding method, we can make ‘complex’ things ‘simple’ and vice versa. This overlooks the fact we are not allowed to use just any code we like! ‘Refined’ MDL tells us to use a specific code, independent of any specific parameterization of the model, leading to a notion of complexity that can also be interpreted without any reference to ‘description lengths’ (see also Section 2.10.1).

The short answer to this criticism is that, while crude MDL might be arbitrary in this sense, refined MDL is not, because there is a principled way of determining the notion of simplicity.

The next criticism is that the truth might not actually be simple.

‘2. Occam’s Razor is false’
It is often claimed that Occam’s razor is false - we often try to model real-world situations that are arbitrarily complex, so why should we favor simple models? In the words of G. Webb: ‘What good are simple models of a complex world?’ The short answer is: even if the true data generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL (and the corresponding form of Occam’s razor) is a strategy for inferring models from data (“choose simple models at small sample sizes”), not a statement about how the world works (“simple models are more likely to be true”) – indeed, a strategy cannot be true or false, it is ‘clever’ or ‘stupid’. And the strategy of preferring simpler models is clever even if the data generating process is highly complex

The response to this criticism is that Occam’s razor is a methodological imperative. Work in the philosophy of inductive inference has argued for this position as well.

(I have elided an example with useful discussion.) The following will be made precise later, but it is a justification for this methodological perspective on Occam’s razor.

The Inherent Difference between Under- and Overfitting

If we choose an overly simple model for our data, then the best-fitting point hypothesis within the model is likely to be almost the best predictor, within the simple model, of future data coming from the same source. If we overfit (choose a very complex model) and there is noise in our data, then, even if the complex model contains the ‘true’ point hypothesis, the best-fitting point hypothesis within the model is likely to lead to very bad predictions of future data coming from the same source. This statement is very imprecise and is meant more to convey the general idea than to be completely true. As will become clear in Section 2.10.1, it becomes provably true if we use MDL’s measure of model complexity; we measure prediction quality by logarithmic loss; and we assume that one of the distributions in \(\mathcal{H}\) actually generates the data.

For next time

Read section 2.2 Information Theory I


  1. Technically, we have to encode commas and parentheses into binary numbers as well, but I will ignore that here. ↩︎

  2. Using a hypothesis to encode the data didn’t actually help for the decimal representation. ↩︎