Next reading: Grünwald 2004

Having finished reading Chomsky 1992, we will now turn to the minimum description length (MDL) principle. MDL offers a conceptually simple way to view inductive inference (hypothesis formation) in any domain:

Choose the hypothesis that best describes the data

Seems like a no-brainer, right? Indeed, I think pretty much everyone says/thinks something that could be phrased along these lines. The meat of the MDL principle is that ‘best describing the data’ is defined in a particular way:

A hypothesis describes data by compressing it.

The hypothesis that describes the data best is the one that compresses it the most.

It is standard in the experimental behavioural sciences (like psychology) to speak of hypotheses ‘fitting’ the data. One typically will see a scatterplot of data points (perhaps averaged over participants, with outliers removed) with a curve running through it. This curve represents a hypothesis (typically, the one with the best fit) about the data. The way that fit is measured is by comparing how close the curve is to each actual data point; the closer the curve is to the point the closer the hypothesis came to getting its prediction exactly right.

Under the MDL perspective, our goal is not to fit a curve to the data, but rather to make our description of the data as short as possible. What the example above looks like from this perspective is as follows. The data is a collection of points in (say) 2-dimensional space, so a finite set of pairs of real numbers, corresponding to, for example, the voltage (\(x\)) of a shock and the volume (\(y\)) of the resulting scream. The curve has a simple formula, say \(f(x) = 7x^{2} + 2x + 14\). In order to write this formula down, we need something like 9 symbols (the number of symbols on the right hand side of the equation). Now, however, instead of having to record, for each tested voltage \(x_{i}\) the volume \(y_{i}\) of the elicited scream, we can record instead the difference \(f(x_{i}) - y_{i}\). As bigger numbers typically have more digits than smaller numbers, the difference \(f(x_{i}) - y_{i}\) will be shorter (to write down) than the original number \(y_{i}\) the closer \(f(x_{i})\) is to \(y_{i}\). Thus if \(f(x_{i})\) is on the whole closer to \(y_{i}\) than not, the use of the function \(f\) allows us to compress the description of our data. What the psychologist described as goodness of fit, the MDL-er describes as amount of compression.

The underlying philosophy of MDL is that we are trying to make sense of the world by identifying regularities (or patterns) in the data. If the data contains patterns, these can be used to give a shorter description of the data. The more and better patterns we can discover, the shorter and shorter will our description of the data become.

The hunt for regularities in the world becomes the desire to give a concise description of what we have seen. It is a remarkable empirical fact that short hypotheses that compress the data well also predict future data well!

For next time, please read sections 1.1 and 1.2 of Grünwald’s MDL tutorial!.