Introduction
Hidden Markov Models (HMMs) are a widely used tool in solving all sorts of problems from natural language processing to signal reconstruction. Here I give an overview of the fundamental concepts involved. Let’s go!
Markov Processes
HMMs make the assumption that whatever is going on is following a Markov process, so let’s start with that. This isn’t complicated at all: it just means that what happens at time $ t $ only depends on what happened at time $ t-1 $, and not before. This is a pretty strong assumption, but it greatly simplifies analysis, and works pretty well in practice. Regardless, it’s a good starting point before we explore richer models. An example would be avoiding the same meal over consecutive days, but not caring about the further past. Slightly formally,
$$ P(X_t = x_t \mid X_{t-1}=x_{t-1}, X_{t-2}=x_{t-2}, …, X_0=x_0) = P(X_t = x_t \mid X_{t-1}=x_{t-1}) $$
So we have a bunch of states, with transition probabilities between them. We can represent this compactly as a transition diagram. Remember that the indices here are for different states, not different time periods (make sure you understand this):
The transition probabilities can of course be encoded in a matrix $ A $, where entry $ a_{ij} $ is the probability of going from state $ x_i $ to state $ x_j $ from any period $ t $ to $ t+1 $. Again, don’t let the above compact representation fool you. In reality, the unrolled graph over time looks like this:
And voilĂ , we have a Markov process!
HMMs
Hidden Markov Models assume there is an underlying Markov Process going on. The word hidden then comes from the fact that we do not observe the actual process, but rather a proxy for it. One can imagine all sorts of scenarios that fit this model: receiving a message that might be distorted, choosing a meal based on the previous day and then choosing a drink based on that…
The states in say set $ \mathbf{X} $ are now hidden from us, and we observe related states in say $ \mathbf{Y} $. Let’s look at the compact transition diagram for this, and I’ll leave the unrolled diagram to your imagination:
Note that we may have different numbers of states in the two sets. Remember that it is the states in $ \mathbf{X} $ that follow the Markov process. We just now assume that based on which of these we are in at time $ t $, there is a chance of seeing some $ y_i $. And where to encode these probabilies? You got it: a matrix $ B $ where entry $ b_{ij} $ is the probability of seeing state $ y_j $ given we are in fact in $ x_i $. Think about the dimensions of $A$ and $B$.
Question: can you tell me what $AB$ represents?
The Problem
Let’s be a bit more concrete about what we’re trying to solve. We assume a Markov process is going on, let’s say for times $ t=0…T $. We care about finding the real states $ x*{t=0…T} $, but we only observe $ y*{t=0…T} $. A good approach would be to find what states in $ \mathbf{X} $ are most likely given our observed states in $ \mathbf{Y} $.
Let $ x* = (x_0, x_1, …, x_T) $ be a vector of our solutions, and $y$ be the vector of observations. Mathematically, we want:
$$ \begin{eqnarray} x* &=& \underset{x} {\mathrm{argmax}} ~P(y) \ &=& \underset{x} {\mathrm{argmax}} ~\sum_{x} P(y \mid x) \cdot P(x) \end{eqnarray} $$
We simply marginalised $P(y)$. Notice that we have the information for $ P(y \mid x) $ in our transition matrix $ B $. The tricky part is figuring out the joint probabilities for all the possibilities of $x$. Let’s look at the two main approaches to solving for this.
Some Approaches
Quick Step Back
Before we dive in, let’s make sure we understand what the “hard” part is. We’re looking to maximise over all the possible joint probabilities of $x$. This get exponentially harder with each time step. If $ |\mathbf{X}| = m $, at first we’re choosing between $m$ values. For two time steps, we’re looking at $m^2$ possible paths, and so on up to $m^T$ for $T$ time steps.
This problem is begging for some dynamic programming, where we remember paths taken to avoid total recalculation at each time step, kind of like Hansel and Gretel.
The Forward Algorithm
Let’s say we just want to estimate $x_t$, for one time step, and we have observations from previous time steps. In other words we want $P(x_t, y_{t=0:t})$. Let’s call it $\alpha_t(x_t)$. This can be marginalised over all possibilities for $x_{t-1}$:
$$ \alpha_t(x_t) = P(x_t, y*{t=0:t}) = \sum*{x*{t-1}} P(x*t, x*{t-1}, y_{t=0:t}) $$
We’re looking to get something recursive. We can use the chain rule on the above:
$$ \alpha_t(x_t) = \sum_{x_{t-1}} P(y_t \mid x_t, x_{t-1}, y_{t=0:t-1}) P(x_t \mid x_{t-1}, y_{t=0:t-1}) P(x_{t-1}, y_{t=0:t-1}) $$
The first term in the sum can be simplified since our transition matrix $B$ encodes $y_t$ given $x_t$, regardless of other information:
$$ \begin{eqnarray} \alpha_t(x_t) &=& P(y_t \mid x_t) \sum_{x_{t-1}} P(x_t \mid x_{t-1}, y_{t=0:t-1}) P(x_{t-1}, y_{t=0:t-1}) \ &=& P(y_t \mid x_t) \sum_{x_{t-1}} P(x_t \mid x_{t-1}) \alpha_{t-1}(x_{t-1}) \end{eqnarray} $$
We have our recursion! This means that at each time step, we can do a quick calculation based on our transition probabilities in $A$ and $B$. The algorithm is no longer exponential in the number of time steps. Now that we have a feel for how to attack this problem, we can look at the closely related Viterbi algorithm to solve for overall maximum likelihood!
The Viterbi Algorithm
Remember that we want to find the most likely $ x^* = (x_0, x*1, …, x*T) $ based on our observations $y_{t=0…T}$. What we’ll do at each time step is remember the highest likelihood path. That way at each step $t$ to $t+1$ we’re only calculating the most likely extra step required.
One thing we haven’t mentionned that we’ll need is an inital distribution for the states at time 0. Call this $\pi \in \mathbb{R}^m$. The way we recall likelihood is through a “value function” for each time step:
$$V_{t} = (V_{t,1}, …, V_{t,m})$$
That is the total “value” from getting to each state at time $t$. We then define the recursion:
$$ \begin{eqnarray} V_{1,k} &=& P(y_1 \mid k) \cdot \pi_k \ V_{t,k} &=& P(y_t \mid k) \cdot max_{x \in \mathbf{X}}( a_{x,k} \cdot V_{t-1, x} ) \end{eqnarray} $$
Recall there $a_{ij}$ just comes from our transition matrix $A$. It may take a second to get your head around this, but just remember we’re taking the highest likelihood path at each step. By saving the $x$ value we chose at each step, we have our sequence $x^*$.
Question: Can you figure out the big-O complexity of this algorithm in terms of $T$ and $|\mathbf{X}|$?
Conclusion & Extensions
Hidden Markov Models are extremely useful in practice. By making a few assumptions, they can provide a good balance of informational richness and tractability. HMMs are also a great stepping stone to more generalised probabilistic graphical models!
Now that we’ve explored the basics, can you think of some extensions to our model? Here are some that initially come to my mind:
- Consider different transition matrices $A_t$ and $B_t$ for each time step
- Allow for multiple hidden layers
- Solve using Monte Carlo methods
- Extend the Markov assumption to 2, 3, or more time steps back!
I’m sure there are many more. Hope you had fun, and thanks for reading!