speech recognition

Baum-Welch algorithm for training a Hidden Markov Model — Part 2 of the HMM series

Baum-Welch 算法（德国人发明）用于训练隐马尔科夫模型--隐马尔可夫序列的第二部分

How to train a Hidden Markov Model and use it for filtering and smoothing?

如何训练一个隐马尔科夫模型且使用它过滤和平滑呢？

Part 1: Architecture of the Hidden Markov Model（第一部分：隐马尔可夫模型的架构）

Part 2: Algorithm to train a HMM: Baum-Welch algorithm（去训练一个HMM的算法：Baum-Welch算法）

Part 3: Algorithm to predict with a trained HMM: Viterbi algorithm

（第三部分：算法去预测用一个被训练的HMM:维特比算法）

In the last article, I talked about the architecture and the parametrization of the Hidden Markov Model (HMM), and the meaning of variables that I will use here. In this article, we will talk about the algorithm for training up a HMM, before making use of it for prediction. 在最后的文章中，我讨论了架构和隐马尔可夫模型的参数化，变量的意义我将在这里使用。在这个文章中，我们将讨论算法用于训练一个HMM,在使用它以前用于预测。

Baum-Welch algorithm（B-W算法）

Also known as the forward-backward algorithm, the Baum-Welch algorithm is a dynamic programming approach and a special case of the expectation-maximization algorithm (EM algorithm). Its purpose is to tune the parameters of the HMM, namely the state transition matrix A, the emission matrix B, and the initial state distribution π₀, such that the model is maximally like the observed data.也周知为前向-后向算法，BW算法是一个动态规划方法，且一个期望最大化算法（EM算法）的特殊案例。它的目的是去调谐HMM的参数，也就是状态转移矩阵A，产出矩阵B，和初始状态分布$\pi_0$，这样模型与观测到的数据最相似。

There are a few phases for this algorithm, including the initial phase, the forward phase, the backward phase, and the update phase. The forward and the backward phase form the E-step of the EM algorithm, while the update phase itself is the M-step.

有几个阶段对于这个算法，包含初始阶段，前向阶段，后向阶段，以及更新阶段。前向和后向阶段形成EM算法的E-步，而更新阶段自身是M-步。

Initial phase 初始阶段

In the initial phase, the content of the parameter matrices A, B, π₀ are initialized, and it could be done randomly if there is no prior knowledge about them.在初始阶段，参数矩阵的内容A,B,\pi_0被初始化，且他能被随机做出若没有先验知识关于它们的

Forward phase前向阶段

In the forward phase, the following recursive alpha function is calculated. For the deviation（背离，偏离; 〈数〉偏差） of the function, I would strongly recommend this YouTube video as the speaker presented it clearly and explained it very well.在前向阶段，跟随的递归alpha函数被计算出来。对于函数的偏离，我将强烈推荐油管正如说话者呈现它清晰地且解释地十分好。

There are a few points to make here:有几点要做出明确解释在这里：

The alpha function is defined as the joint probability of the observed data up to time k and the state at time k：alpha函数被定义为被观测数据地联合概率直到时间k和在时间K的状态。
It is a recursive function because the alpha function appears in the first term of the right hand side (R.H.S.) of the equation, meaning that the previous alpha is reused in the calculation of the next. This is also why it is called the forward phase. 由于它是递归函数因为alpha函数出现在等式的右手边的第一项，意味着前一个alpha被重新使用在计算下一个。这也是为何他被称为前向阶段。
The second term of the R.H.S. is the state transition probability from A, while the last term is the emission probability from B. 公式右手边的第二项是从A的状态转移概率，而最后一项是从B的发射概率。
The R.H.S. is summed over all possible states at time k -1.公式的右手边求和所有可能的状态在时间t-1.

It should be pointed out that, each alpha contains the information from the observed data up to time k, and to get the next alpha, we only need to reuse the current alpha, and add information about the transition to the next state and the next observed variable. This recursive behavior saves computations of getting the next alpha by freeing us from looking through the past observed data every time. 应该指出每一个alpha包含信息来自被观测的知道时间k的数据，为了获得下一个alpha，我们仅仅需要去重新使用当前的alpha，且加总信息关于转移到下一个状态且下一个被观测的变量。这个递归行为保存了获得下一个alpha的计算通过释放我们看穿过去被观测的每一个时间数据。

By the way, we need the following starting alpha to begin the recursion.

the starting alpha is the product of probabilities of the emission and the initial state 顺便说一句，我们需要下面开始的alpha去开始递归。

Backward phase后向阶段

Please refer to this YouTube video for the deviation of the following formula.请看这个油管对于下面公式的偏离

Similar points could be made here: 相似的点要解释在这里：

The beta function is defined as the conditional probability of the observed data from time k+1 given the state at time k
It is a recursive function because the beta function appears in first term of the right hand side of the equation, meaning that the next beta is reused in the calculation of the current one. This is also why it is called a backward phase.
The second term of the R.H.S. is the state transition probability from A, while the last term is the emission probability from B.
The R.H.S. is summed over all possible states at time k +1.

Again, we need the ending beta to start the recursion.

Hold on! Why the alpha and the beta functions?

Good question!

Firstly, as mentioned, they are both recursive functions, which means that we could reuse previous answer as the input for the next answer. This is what dynamic programming is about — you could save time by reusing old result!

Secondly, the formula in the forward phase is very useful. Suppose you have a set of well-trained transition and emission parameters, and given that your problem is to, in real-time, find out the mysterious hidden truth from observed data. Then you actually could do it like this! When you get one data point (data point p), then you could put it into the formula which will give you the probability distribution of the associated hidden state, and from which you could pick the most probable one as your answer. And the story does not stop here, as you get the next data point (data point q), and you put it again into the formula, it will give you another probability distribution for you to pick the best choice, but this is not only based on data point q and the transition and emission parameters, but also the data point p. Such use of the formula is called filtering.

Thirdly, and continuing the above discussion, that suppose you collected many data points already, and because you know that the earlier the data point, the less observed data the choice of your answer based on. Therefore you would like to improve that by somehow ‘injecting’ information from the later data into the earlier ones. This is where the backward formula comes into play. Such use of the formula is called smoothing.

Fourthly, this is about the combination of the last two paragraphs. With the help of the alpha and the beta formula, one could determine the probability distribution of the state variable at any time k given the whole sequence of observed data. This could also be understood mathematically.

The denominator term is a normalization constant and is usually dropped like this because it does not depend on the state, and therefore it is not important when comparing the probability of different states at any time k.

Lastly, the result from the alpha and the beta functions are useful in the update phase.

Update phase

For the deviation of the above formulas, if you have watched the YouTube videos that I suggested for the forward and backward formula, and you can understand them, then probably you will have no problem to derive these two yourself.

The first formula here is just repeating what we have seen above, and to recap, it is to tell us the probability distribution of a state at time k given all observed data we have. The second formula, however, tells us a bit different thing which is the joint probability of two consecutive states given the data. They make use of the alpha function, the beta function, the transition and the emission that are already available. These two formulas are further used to finally do the update.

The deviation steps are not shown here, because mathematics is not the intention of this article, but showing the formulas themselves would be useful for us to see how they are being re-used through the steps.

It was mentioned that the Baum-Welch algorithm is a case of EM algorithm. Here I will explain why very briefly. The alpha and the beta function form the E-step because they predict for the expected hidden states given the observed data and the parameter matrices A, B, π₀. The update phase is the M-step because the last three update formulas are so derived that the L.H.S. parameters will best fit the expected hidden states given the observed data.

Summary

The Baum-Welch algorithm is a case of EM algorithm that, in the E-step, the forward and the backward formulas tell us the expected hidden states given the observed data and the set of parameter matrices before-tuned. The M-step update formulas then tune the parameter matrices to best fit the observed data and the expected hidden states. And these two steps are then iterated over and over again until the parameters converged, or until the model has reached some certain accuracy requirement.

Like any machine learning algorithm, this algorithm could be overfitting the data, as by definition the M-step encourages the model to approach the observed data as good as possible. Also, although we have not talked too much about the initial phase, it indeed affects the final performance of the model (as a problem of trapping the model in local optimum), so one might want to try different ways of initializing the parameters and see what works better.