Grünwald 2004 (part 1)

2020-06-07

I will walk through the text here, commenting on things that I think could use being expanded upon. Our judgements in these regards will surely differ!

How does one decide among competing explanations of data given limited observations? This is the problem of model selection. It stands out as one of the most important problems of inductive and statistical inference. The Minimum Description Length (MDL) Principle is a relatively recent method for inductive inference that provides a generic solution to the model selection problem.

Model selection is often contrasted with parameter setting, although the distinction between the two is not sharp. We can think of parameter setting in the linguistic (P&P) sense, where UG provides us with a basic grammatical architecture, and the learner’s job is to set the values of the grammatical parameters (pro-drop? Head-initial? Bounding Nodes? …) In connectionist models, parameter setting might involve setting the activation threshholds for individual neurons. With parameter setting, much of the work is already done, you need ‘only’ to fill in a few gaps. Model selection is the problem of deciding what the basic grammatical architecture should be; should we adopt minimalism, tree-adjoining grammar, and, if minimalism, which concept of phases should we adopt, and will we allow late merger, or not? In a connectionist model, the model setting problem might refer to the initial topology of the network.

MDL is based on the following insight: any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally. The more regularities there are, the more the data can be compressed. Equating ‘learning’ with ‘finding regularity’, we can therefore say that the more we are able to compress the data, the more we have learned about the data. Formalizing this idea leads to a general theory of inductive inference with several attractive properties:

This is a surprising claim! That learning and data compression should have anything in common is not immediately obvious. This paper will (hopefully) make this plausible.

Occam’s razor

MDL chooses a model that trades-off goodness-of-fit on the observed data with ‘complexity’ or ‘richness’ of the model. As such, MDL embodies a form of Occam’s Razor, a principle that is both intuitively appealing and informally applied throughout all the sciences.

Bigger models account for the data you’ve seen better - in the extreme case, your model can just be an exhaustive list of the data you’ve seen; what could be a more precise account than that! However, the bigger the model is, the more it will account for the accidental patterns in the data (it will describe everything). A small model will not be able to account for all of the patterns in the data, because it is, well, too small. The ‘sweet spot’ is a model big enough to account for the data, but small enough to avoid accounting for the accidental patterns.

No overfitting, automatically

MDL procedures automatically and inherently protect against overfitting and can be used to estimate both the parameters and the structure (e.g., number of parameters) of a model. In contrast, to avoid overfitting when estimating the structure of a model, traditional methods such as maximum likelihood must be modified and extended with additional, typically ad hoc principles.

Overfitting refers to the situation I described above: your theory is too tailored to the data, and captures accidental patterns. Thus, it predicts these accidental patterns should extend to new data.

Bayesian interpretation

MDL is closely related to Bayesian inference, but avoids some of the interpretation difficulties of the Bayesian approach, especially in the realistic case when it is known a priori to the modeler that none of the models under consideration is true. In fact:

Bayesianism is attractive because probability is the logic of uncertain reasoning.

No need for ‘underlying truth’

In contrast to other statistical methods, MDL procedures have a clear interpretation independent of whether or not there exists some underlying ‘true’ model.

Is this a selling point for us? I don’t really think so, as we linguists typically assume that there actually is a mental grammar, and that we are trying to figure out what it is.

Predictive interpretation

Because data compression is formally equivalent to a form of probabilistic prediction, MDL methods can be interpreted as searching for a model with good predictive performance on unseen data.

This is a justification for the initial claim that it is reasonable to view compression as learning; compression is equivalent to prediction…

One thing that emerges from these claims is the following: MDL is a general purpose method for learning, that we can apply in any domain, in a uniform way.

We are interested in developing a method for learning the laws and regularities in data. The following example will illustrate what we mean by this and give a first idea of how it can be related to descriptions of data.
Regularity

… Consider the following three sequences. We assume that each sequence is 10000 bits long, and we just list the beginning and the end of each sequence.

00010001000100010001 … 0001000100010001000100010001

01110100110100100110 … 1010111010111011000101100010

00011000001010100000 … 0010001000010000001000110000

The first of these three sequences is a 2500-fold repetition of 0001. Intuitively, the sequence looks regular; there seems to be a simple ‘law’ underlying it; it might make sense to conjecture that future data will also be subject to this law, and to predict that future data will behave according to this law. The second sequence has been generated by tosses of a fair coin. It is intuitively speaking as ‘random as possible’, and in this sense there is no regularity underlying it. Indeed, we cannot seem to find such a regularity either when we look at the data. The third sequence contains approximately four times as many 0s as 1s. It looks less regular, more random than the first; but it looks less random than the second. There is still some discernible regularity in these data, but of a statistical rather than of a deterministic kind. Again, noticing that such a regularity is there and predicting that future data will behave according to the same regularity seems sensible.

…and Compression
We claimed that any regularity detected in the data can be used to compress the data, i.e. to describe it in a short manner. Descriptions are always relative to some description method which maps descriptions D′ in a unique manner to data sets D. A particularly versatile description method is a general-purpose computer language like C or Pascal. A description of D is then any computer program that prints D and then halts. Let us see whether our claim works for the three sequences above. Using a language similar to Pascal, we can write a program
for i in {1..2500} ; do
    printf "0001"
done
which prints sequence (1) but is clearly a lot shorter. Thus, sequence (1) is indeed highly compressible. On the other hand, we show in Section 2.2, that if one generates a sequence like (2) by tosses of a fair coin, then with extremely high probability, the shortest program that prints (2) and then halts will look something like this:
printf "011101001101000010101010........1010111010111011000101100010"
This program’s size is about equal to the length of the sequence. Clearly, it does nothing more than repeat the sequence. The third sequence lies in between the first two: generalizing n = 10000 to arbitrary length n, we show in Section 2.2 that the first sequence can be compressed to O(log n) bits; with overwhelming probability, the second sequence cannot be compressed at all; and the third sequence can be compressed to some length αn, with 0 < α < 1.

Observe that the goal in this example is wholly and exclusively to compress a particular sequence. Each sequence represents a different compression task (i.e., the goal is not to compress all three at once, but each individually). No attention is paid to how the sequence was generated, nor to how the sequence might continue. The situation here is analogous to having elicited sentences from a native speaker, and then going home and compressing them without a thought to next week’s elicitation session.

Example 1 [compressing various regular sequences]

The regularities underlying sequences (1) and (3) were of a very particular kind. To illustrate that any type of regularity in a sequence may be exploited to compress that sequence, we give a few more examples:

The Number π: Evidently, there exists a computer program for generating the first n digits of π – such a program could be based, for example, on an infinite series expansion of π. This computer program has constant size, except for the specification of n which takes no more than O(log n) bits. Thus, when n is very large, the size of the program generating the first n digits of π will be very small compared to n: the π-digit sequence is deterministic, and therefore extremely regular.

Physics Data: Consider a two-column table where the first column contains numbers representing various heights from which an object was dropped. The second column contains the corresponding times it took for the object to reach the ground. Assume both heights and times are recorded to some finite precision. In Section 1.3 we illustrate that such a table can be substantially compressed by first describing the coefficients of the second-degree polynomial H that expresses Newton’s law; then describing the heights; and then describing the deviation of the time points from the numbers predicted by H.

Natural Language: Most sequences of words are not valid sentences according to the English language. This fact can be exploited to substantially compress English text, as long as it is syntactically mostly correct: by first describing a grammar for English, and then describing an English text D with the help of that grammar [Grünwald 1996], D can be described using much less bits than are needed without the assumption that word order is constrained.

These three examples again exemplify the claim that compression is a generic task that can be performed in the same way no matter what data you are looking at.

To formalize our ideas, we need to decide on a description method, that is, a formal language in which to express properties of the data.

If we are to express regularities in the data, we need to have a language for writing regularities down!

The most general choice is a general-purpose computer language such as C or Pascal.

‘Most general’ in the sense that a general-purpose programming language can describe any computable way of defining a list of data. In full generality, we simply need a precise and unambiguous way of defining a list of data, whether computable or not. Natural language is not adequate, as it is ambiguous. Choosing a general-purpose (i.e. universal) programming language is useful, as we understand such things very well at a formal level.

This choice leads to the definition of the Kolmogorov Complexity [Li and Vitányi 1997] of a sequence as the length of the shortest program that prints the sequence and then halts. The lower the Kolmogorov complexity of a sequence, the more regular it is.

Observe that the kolmogorov complexity of any sequence (for any reasonable programming language) will be at most roughly the same as the length of the sequence: a variant of the following program is expressible in any reasonable language, and is only a constant factor larger than the sequence itself (‘printf’ is 6 characters, plus the space, plus the begin and end quotes, ‘$data’ is 5 characters, and occurs twice, plus the ‘=’ character, plus the newline character):

$data=...
printf "$data"

This notion seems to be highly dependent on the particular computer language used. However, it turns out that for every two general-purpose programming languages A and B and every data sequence D, the length of the shortest program for D written in language A and the length of the shortest program for D written in language B differ by no more than a constant c, which does not depend on the length of D. This so-called invariance theorem says that, as long as the sequence D is long enough, it is not essential which computer language one chooses, as long as it is general-purpose.

Here is one way of understanding this result (which may remove some of Athe mystery). A general purpose language A can be used to write a translator (a compiler) t from B into itself. This translating program has some fixed size, say c. Then take a program p written in B that generates the data D. The program t (p) (giving the translator the program p) is now a programm written in A that generates D, and its size is c + 3 (the space, and the two parentheses).

Kolmogorov complexity was introduced, and the invariance theorem was proved, independently by Kolmogorov [1965], Chaitin [1969] and Solomonoff [1964a,1964b]. Solomonoff’s paper, called A Theory of Inductive Inference, contained the idea that the ultimate model for a sequence of data may be identified with the shortest program that prints the data. Solomonoff’s ideas were later extended by several authors, leading to an ‘idealized’ version of MDL [Solomonoff 1978; Li and Vitányi 1997; Gács, Tromp, and Vitányi 2001]. This idealized MDL is very general in scope, but not practically applicable, for the following two reasons:

uncomputability

It can be shown that there exists no computer program that, for every set of data D, when given D as input, returns the shortest program that prints D [Li and Vitányi 1997].

The uncomputability of kolmogorov complexity follows from the fact that it is undecidable to come up with minimal descriptions even for very constrained classes of things. For example, one cannot construct a minimal context-free grammar for a context-free language. Moreover, even if someone were to give us a context-free grammar, there is no general way to verify that it is a smallest one for its language!

arbitrariness/dependence on syntax

In practice we are confronted with small data samples for which the invariance theorem does not say much. Then the hypothesis chosen by idealized MDL may depend on arbitrary details of the syntax of the programming language under consideration.

The compiler-based intuition I sketched above only matters ‘in the limit’. A malicious programming language developer might decide to implement special purpose features of her language to make printing a particular data set a one-liner. Then while it is true that we can have a constant c which bounds the difference in size between programs in our two languages, this constant might be HUGE.

Like most authors in the field, we concentrate here on non-idealized, practical versions of MDL that explicitly deal with the two problems mentioned above. The basic idea is to scale down Solomonoff’s approach so that it does become applicable. This is achieved by using description methods that are less expressive than general-purpose computer languages. Such description methods C should be restrictive enough so that for any data sequence D, we can always compute the length of the shortest description of D that is attainable using method C; but they should be general enough to allow us to compress many of the intuitively ‘regular’ sequences. The price we pay is that, using the ‘practical’ MDL Principle, there will always be some regular sequences which we will not be able to compress. But we already know that there can be no method for inductive inference at all which will always give us all the regularity there is — simply because there can be no automated method which for any sequence D finds the shortest computer program that prints D and then halts. Moreover, it will often be possible to guide a suitable choice of C by a priori knowledge we have about our problem domain. For example, below we consider a description method C that is based on the class of all polynomials, such that with the help of C we can compress all data sets which can meaningfully be seen as points on some polynomial.

Even more practical MDL methods give up on the requirement that our description methods should allow for computable minimization.

For next time

read the rest of chapter 1