In my work at DARPA, I’ve been exposed to hidden Markov models in applications as diverse as temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, and bioinformatics. My background is in stochastic modeling and optimization, and hidden Markov models are a fascinating intersection between my background and my more recent work with machine learning. Recently, I’ve come across a new twist on the Markov model: the Hierarchical Dirichlet Process Hidden Markov Model.

# What is a Markov model?

Say in DC, we have three types of weather: (1) sunny, (2) rainy and (3) foggy. Lets assume for the moment that the doesnt change from rainy to sunny in the middle of the day. Weather prediction is all about trying to guess what the weather will be like tomorrow based on a history of observations of weather. If we assume the days preceding today will give you a good weather prediction for today we need the probability for each state change:

$$ P(w_n | w_{n-1}, w_{n-2},\ldots, w_1) $$

So, if the last three days were sunny, sunny, foggy, we know that the probability that tomorrow would be rainy is given by:

$$ P(w_4 = \text{rainy}| w_3 = \text{foggy}, w_2 = \text{sunny}, w_1 = \text{sunny}) $$

This all works very well, but the state space grows very quickly. Just based on the above, we would need $3^4$ histories. So fix this we make the **Markov Assumption** that everything really depends on the previous state alone, or:

$$ P(w_n | w_{n-1}, w_{n-2},\ldots, w_1) \approx P(w_n| w_{n-1}) $$

which allows us to calculate the joint probability of weather in one day given we know the weather of the previous day:

$$ P(w_1, \ldots, w_n) = \prod_{i=1}^n P(w_i| w_{i-1})$$

and now we only have nine numbers to characterize statistically.

# What is a hidden Markov model?

In keeping with the example above, suppose you were locked in a room and asked about the weather outside and the only evidence you have is that that ceiling drips or not from the rain outside. We are still in the same world with the same assumptions and the probability of each state is still given by:

$$ P(w_1, \ldots, w_n) = \prod_{i=1}^n P(w_i| w_{i-1})$$

but we have to factor that the actual weather is *hidden* from you. We can do that using Bayes’ rule where $u_i$ is true if the ceiling drips on day $i$ and false otherwise:

$$P(w_1, \ldots, w_n)| u_1,\ldots,u_n)=\frac{P(u_1,\ldots,u_n | w_1, \ldots, w_n))}{P(u_1,\ldots,u_n)}$$

Here the probability $P(u_1,\ldots,u_n)$ is the prior probability of seeing a particular sequence of ceiling leak events ${True,False,True}$. With this, you can answer questions like:

Suppose the day you were locked in it was sunny. The next day the ceiling leaked. Assuming that the prior probability of the caretaker carrying an umbrella on any day is 0.5, what is the probability that the second day was rainy?

So if Markov models consider states that are directly visible to the observer, the state transition probabilities are the only parameters. By contrast, in hidden Markov models (HMMs) the state is not directly visible, but the output, dependent on the state, is visible. A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (or hidden) states. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. In this context, ‘hidden’ refers to the state sequence through which the model passes, not to the parameters of the model; the model is still referred to as a ‘hidden’ Markov model even if these parameters are known exactly.

# OK, so what is a Hierarchical Dirichlet Process Hidden Semi-Markov Model?

Hidden Markov models are generative models where the joint distribution of observations and hidden states, or equivalently both the prior distribution of hidden states (the transition probabilities) and conditional distribution of observations given states (the emission probabilities) are modeled. Instead of implicitly assuming a uniform prior distribution over the transition probabilities, it is also possible to create hidden Markov models with other types of prior distributions. An obvious candidate, given the categorical distribution of the transition probabilities, is the Dirichlet distribution, which is the conjugate prior distribution of the categorical distribution.

In fact, it is possible to use a Dirichlet process in place of a Dirichlet distribution. This type of model allows for an unknown and potentially infinite number of states. It is common to use a two-level Dirichlet process, similar to the previously described model with two levels of Dirichlet distributions. Such a model is called a hierarchical Dirichlet process hidden Markov model, or HDP-HMM for short or it is also called the “Infinite Hidden Markov Model”.

The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) is a natural Bayesian nonparametric extension of the traditional HMM. The single parameter of this distribution (termed the concentration parameter) controls the relative density or sparseness of the resulting transition matrix. By using the theory of Dirichlet processes it is possible to integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence.

This is really cool. If you formulate a HMMs with a countably infinite number of hidden states,

you would have infinitely many parameters in the state transition matrix. The key idea is that the theory of Dirichlet processes can implicitly integrate out all but the three parameters which define the prior over transition dynamics. It is also possible to use a two-level prior Dirichlet distribution, in which one Dirichlet distribution (the upper distribution) governs the parameters of another Dirichlet distribution (the lower distribution), which in turn governs the transition probabilities. The upper distribution governs the overall distribution of states, determining how likely each state is to occur; its concentration parameter determines the density or sparseness of states. Such a two-level prior distribution, where both concentration parameters are set to produce sparse distributions, might be useful for example in unsupervised part-of-speech tagging, where some parts of speech occur much more commonly than others; learning algorithms that assume a uniform prior distribution generally perform poorly on this task. The parameters of models of this sort, with non-uniform prior distributions, can be learned using Gibbs sampling or extended versions of the expectation-maximization algorithm.

# So how can we use this?

A common problem in speech recognition is segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. This problem is often called *speaker diarization*. This is particularly challenging since you don’t know the number of people participating in the meeting and modified HDP-HMMs have been very effective at achieving state-of-the-art speaker diarization results.

Other interesting applications of HDP-HMMs have been modeling otherwise intractable linear dynamical systems which describing dynamical phenomena as diverse as human motion, financial time-series, maneuvering targets, and the dance of honey bees. (See this paper for more details. Results have shown that HDP-HMM can identify periods of higher volatility in the daily returns on the IBOVESPA stock index (Sao Paulo Stock Exchange). Most interesting to me was the application to using HDP-HMMs on a set of six dancing honey bee sequences aiming to segment the sequences into distinct dances.

You can see some other cool motion capture examples here.

## Leave a Reply