Grünwald 2004 (part 4)

Statistical Preliminaries

probabilistic model
remember that a model is a set of hypotheses; a model is the wiggle room that we have for defining a particular theory. So a probabilistic model is a set of probabilistic hypotheses. If we think of a theory as a way of generating predictions, a probabilistic hypothesis/theory is a probabilistic source of predictions.

Often, the wiggle room that a model provides is structured in a particular way. As an example, you could think that all languages share the same set of lexical feature bundles, but can differ as regards to whether each feature is strong or weak. Then the model of minimalism would be a set of possible lexical feature bundles, and a particular hypothesis (for a particular language) would require us to specify which features on which feature bundles are strong; we can call this a parameter setting. The parameter space would be the set of all possible parameter settings.

If we have a probabilistic model, then our hypotheses can be viewed as probability distributions - often each hypothesis can be viewed as a fixing of some parameters left open by the model. It is typical to view parameters as a sequence of real numbers, e.g. \(\theta = \langle 0, -8, 415.117, -\pi, \frac{32}{17}\rangle\). In principle, the parameters can be related in arbitrary ways to the hypotheses; we could have a hypothesis space (a model) that used parameters to define point hypotheses as follows: if the first parameter is a prime number, than do XXX otherwise do YYY, etc. This is ‘unnatural’ - smoothly changing the parameters leads to large and erratic ‘jumps’ in the described point hypotheses. When the parameters of the model are related in regular ways to its point hypotheses, the model is called parametric.

maximum likelihood
Given some data \(D = x^{n}\), each (probabilistic) point hypothesis assigns to \(D\) its probability of occuring, its likelihood, as it were. A point hypothesis which assigns to the data the highest probability (likelihood) is called a maximum likelihood hypothesis. If we have a parametric model, then each point hypothesis is determined by some parameter setting \(\theta\). A function \(\hat{\theta}\) which assigns to any data \(D\) a parameter setting which maximizes the data’s likelihood is a maximum likelihood estimator. This ‘estimator’ assigns to every possible data a hypothesis which best accounts for it in the sense of making it as likely as possible relative to the hypotheses available given the model. If the estimator has the additional property that, as we see more and more data, we converge to the true hypothesis (if one exists), the estimator is said to be consistent.
Markov chain
A markov chain is a way of assigning probablities to a sequence of observations. Here we will restrict ourselves to a binary alphabet, although the model is defined more generally. A sequence \(wa\) of observations is in general assigned a probability of the form \(P(a| w)\) (pronounced: the probability of a given w). This represents the idea that nature is producing observables one at a time, based on its current state, and its state is dependent on everything that has occured before. The fundamental assumption behind the markov chain model is that the state of nature at any given time is completely determined by the previous k things, and nothing else. We can think of a markov chain as being given by a family of probability distributions over single observations: \(P(\cdot | w)\), for every length-k sequence of possible observations w. This amounts in essense to saying: the probability of seeing a particular sequence of data points is the product of seeing each one of them, dependent of the previous k.

If the markov chain is of order 1, a string \(00010\) will be assigned the probability: \[P(0|\lhd)\times P(0|0) \times P(0|0) \times P(1|0) \times P(0|1)\times P(\rhd|0)\] (Here \(\lhd\) and \(\rhd\) are special string boundary symbols; \(P(\cdot| \lhd)\) is the probability of being the first symbol of a string, and \(P(\rhd | \cdot)\) the probability of being the last symbol of a string.)

Bernoulli model
A bernoulli model is the special case of a markov chain where \(k = 0\). That is, the probability of a particular observation at a particular point in time is independent of anything that happened before - there is no ‘memory’ for past events. In our binary alphabet setting, we can think of a bernoulli model as a coin-flip model. The relevant parameters are just the probability \(\theta\) of obtaining a 1 (because the probability of obtaining a 0 is just \(1-\theta\)). The probability of a sequence \(00010\) will be the product \(P(0)^{4}\times P(1)^{1}\). As the probabilities \(P(1)\) and \(P(0)\) are related to one another, this is equal to \(\theta(1-\theta)^{4}\). In general, for any sequence of observations with \(m\) ones and \(n\) zeros, its probability will be \(\theta^{m}(1-\theta)^{n}\). An estimator will take a sequence of data, and come up with a parameterization for it. In the case of a bernoulli model, a maximum likelihood estimation for any sequence of observations chooses the probability of 1s to be the relative frequency of 1s in the data: \(\hat{\theta}(w) = \frac{\left| w\right|_{1}}{\left| w\right|}\). (Here \(\left| w\right|_{a}\) is the number of \(a\)s in \(w\).)
Estimating markov chains
The idea behind the maximum likelihood estimator for the bernoulli model can be generalized to the case of markov chains of arbitrary order. The markov chain needs to have one parameter \(\theta_{w}\) for each sequence of \(w\) preceding observations of length less than or equal to k. We can estimate each of these by setting \(\theta_{w} = P(1| w)\) to be \(\hat{\theta}_{w}(u) = \frac{\left| u\right|_{w1}}{\left| u \right|_{w}}\).

This is all sort of magical: we have a super simple procedure to choose a best markov chain of a particular order which accounts for any sequence of observations.

The difficulty comes in when we try to generalize this to a procedure which generates a markov chain of any order which accounts well for any sequence of observations. We will note that if we can choose the order of the markov chain at our discretion, the unique markov chain which accounts best for a particular series of observations \(w\) is of order \(\left| w\right|\) and simply asserts that this is the only observation possible. On the other hand, the markov chain of order 0 simply views each observation as an independent coin toss, which does not allow for any interesting dependencies between the observations. We can think of this on parallel to learning a grammar; here the observations are words making up a sentence (so we have just one sentence that we are exposed to). The first extreme is to take this sentence as a ‘lexical item’. We account perfectly for the sentence, but do no generalization. The second extreme is to take each word of the sentence as a lexical item, and allow free merge. Here we can describe the sentence, but also everything else - we generalize too much.

Next time

we will see (in section 2.4) how MDL can navigate the charybdis of over-restrictiveness and the scylla of over-generalization in the context of markov chains.