Application of EM Algorithm in Statistics Natural Language Processing

: This study describes the basic framework of EM algorithm and gives how to apply EM algorithm to solve the problem of maximum-likelihood parameters estimation combining with the models of HMM and PCFG. In the process of statistics natural language, one kind of problem is often encountered that is how to solve the parameter's maximum-likelihood estimation when observation data is incomplete. EM algorithm is the classical method to solve this problem. Finally, the advantages and disadvantages of EM algorithm are discussed.


INTRODUCTION
Along with the appearance of large scale machine readable corpus and the rapid increase of computer operation speed and storage capacity, empiricism in natural language processing field also obtains a rapid revival. The introduction based on statistical learning method has made computational linguistics field a great change, this study method can through the training of corpus to automatically or partly automatically process linguistics knowledge, which has great significance to solve the problem "knowledge acquisition bottleneck".
But in the statistics of natural language processing, there are often this kind of parameters valuation issues, that is, when the observed data is incomplete how to solve the maximum likelihood estimation of parameters. EM (Expectation Maximum) algorithm is the classic algorithm to solve this kind of problem, which was brought out by Dempster, Laird and Rubin in 1997 and widely used in parameter estimation of incomplete data. EM algorithm has two major applications (Bilmes, 1997): one is used in the parameter estimation for data loss, another application is assuming there exists other missing parameters (these parameters may not exist or be hidden), which can greatly simplify likelihood function. In natural language statistics field the latter's application is more common. This study first gives the basic framework of general EM algorithm and then combining with Hidden Markov Model (HMM) and probabilistic context-free grammar model presents how to use EM algorithm to solve the parameter of the maximum likelihood estimation, the conclusion are given in the end.
This study describes the basic framework of EM algorithm and gives how to apply EM algorithm to solve the problem of maximum-likelihood parameters estimation combining with the models of HMM and PCFG. In the process of statistics natural language, one kind of problem is often encountered that is how to solve the parameter's maximum-likelihood estimation when observation data is incomplete. EM algorithm is the classical method to solve this problem. Finally, the advantages and disadvantages of EM algorithm are discussed.

BASIC FRAMEWORK OF EM ALGORITHM
In this study the basic framework are presented in literature (Bilmes, 1997;Michael, 1994, 1995). The basic idea of EM algorithm divides the problem into two steps to solve, these are E steps (to the logarithm of complete data set likelihood function to solve conditional expectation) and M steps(to maximize the solved expectations) and then constantly iterate E steps and M steps until work out the maximum points so far. The formal description of algorithm is as follows: Assume the complete data set is Z = (X, Y), data set X is the observed data collection, Y is missing (or hiding) data set, in parameters set Θ of Z, the joint density function about X, Y is p(z|Θ) = p(x, y|Θ) = p(y|x, Θ), hereinto, x∈X, y∈Y. Now the likelihood function of complete data set Z is L(Θ|Z) = L(Θ|X,Y) = p(X, Y| Θ).
Step 1: The step 1 of EM algorithm (E step) is to find logarithm likelihood function log p(X, Y|Θ), while given observation data set X and the current parameter set Θ (i-1) , the expectation value about unknown data set Y is the value to calculate the next expression: Q(Θ, Θ (i-1) ) = E [log p(X,Y|Θ (i-1) )] , hereinto Θ is the new parameter set after optimization and makes the value of function Q increasing with the new parameter.
Step 2: The step 2 (M step) of EM algorithm is to maximize expectation value of part 1, that is next expression two steps constantly iterates, each iteration will ensure to increase the logarithm likelihood function values and ensure that likelihood function converges to a local maximum value point.

THE APPLICATION OF EM ALGORITHM IN THE STATISTICS OF NATURAL LANGUAGE PROCESS
EM algorithm has a wide application range in the statistics of natural language process, such as the forward-backward algorithm in HMM, the insideoutside algorithm in PCFG, EM clustering algorithm and no supervision semantic disambiguation algorithm, which are the specific applications of EM algorithm for parameter estimation problems. Below there are the detail parameter estimate process between forwardbackward algorithm and inside-outside algorithm.
The parameter estimation problem of HMM: HMM parameter estimation problem is according to some observation sequence to estimate a group of HMM parameters (A, B, π), which makes the probability maximization of producing these observation sequences under this model's parameters. Forward-backward algorithm (also called Baum-Welch algorithm) is an often used method, the proposed algorithm is equivalent to EM algorithm. The model parameter λ = (A, B, π) may be adjusted to local extremum of P(O|λ); this is a revaluation iterative process of parameters. In order to facilitate description, forward-backward variables can be defined: Forward variable is: L which represents in a given model λ, from moment 1 to moment t the observed sequence is (O 1 , O 2 , … , O t ) and at t moment the system state is the probability of s i .
which represents in the condition of a given model and at moment t the status is s i , from moment t+1 to moment T the probability of generating the observed sequence ( ) The two variables can be conducted through the forward and backward process, the details can be checked in literature (Christopher, 1999), the observation data of HMM is and incomplete data set of the likelihood function is P(O|λ) and the likelihood function of complete data set is P(O, Q|λ). So Q function is defined as: λ is the new parameter estimated from current parameter and observing sequence O, γ is the status sequence value space with length T. A particular status sequence q is given, P(O,q|λ) can be written as: In the expression: 0 q π represents initial condition, a qt-1qt represents the probability of status q t-1 transferring to status q t , represents the probability of launch symbol O t in status q t . Therefore, Q function can be rewritten as: Because the parameters need to be optimized are distributed in above three independent expressions, so each part can be independent optimized to achieve the expression of the parameters. Lagrange multiplier method is used respectively to calculate the conditional extremum values to these three parts under the constraint conditions: At last the three parameters can be obtained: The probability in above three expressions can obtained through the forward variable α t (i) and backward variable β t (i).
EM algorithm starts from an initial model λ = (A, B, π), using above set of parameter revaluation formula to get a new model to replace the original model. With the constant iteration, Baum proved that the new model P(O|λ) will continue to change until reach local maximum value point. The final obtained Hidden Markov Model is called the maximum likelihood model; this model makes the probability of an observed sequence O maximization.
After the successful application in speech recognition areas of Hidden Markov Model (Rabiner, 1989), in part-of-speech tagging field which also had achieved great success and made the part-of-speech tagger to the practical. In literature (Cutting et al., 1992) part-of-speech tagger based on HMM used half part of Brown corpus (500 000 words) to do training, after eight times of iteration training, the other half part of Brown corpus is done mark, accuracy achieves 96%, the accuracy is also better at present. Of course, Hidden Markov Model own problems also cannot be ignored, such as data sparse problem, which needs lots of training linguistic data and the parameters number of the model is too big, linguistic knowledge gained from corpus is inconvenient for artificial reading and can't get long distance linguistic information, etc. In order to overcome these shortcomings, some deformations of Hidden Markov Model are brought out, such as: variable memory Hidden Markov Model, gradation Hidden Markov Model, etc, these improvements in some degree overcome these shortcomings.
The parameter estimation problem in probabilistic context-free grammar (PCFG): Probabilistic Context-Free Grammar (PCFG), is a simple Context-Free Grammar (CFG) after adding probability for rule, probability indicates the possibility size of different rewrite rules. PCFG through three hypotheses (that is, location-free hypothesis, context-free assumption, ancestors-free assumption), not only inherits the context-free of CFG program, also make probability value to be context-free used, so that it can compute the probability of each analysis tree, the analysis tree with the biggest probability is the most probable analysis tree.
The introduction of rules probability is beneficial to the solution of syntactic disambiguation problem, also increases the flexible process capability of sentence structure analysis.
PCFG model also has a parameter estimation problem, that is, a grammar G and training sentence W lm are given, how to select probabilities for grammar rules, making the probability of sentence training arg max ܲ (ܹ ‫)ܩ|‬ maximum. In the standard form of Chomsky, a PCFG parameter is: The probability of grammar rules: The probability of vocabulary rules: The constraint conditions of parameters: In order to estimate the value of parameters, at first outside variables α j (p,q) and inside variables β j (p, q) need to be defined.
To outside variables: To inside variables: The inside variable β j (p, q) is the probability of word string w pq = wp w p+1 … w q deduced by nonterminal symbol derived by the grammar beginning symbol N s . The two variable values can be obtained by PCFG context-free hypothesis.
In EM algorithm (inward-outward algorithm), noobservation (or hidden) data is rules N j →ξ (including N j →N s N r and N j →w k ) being used to create a particular word sequence w pq . E step is the expected use times to calculate the rule; M step is the maximum likelihood estimation about probability of calculation rules. It is allowed to train PCFG on the corpus without syntax ingredient label.
Here introduces inward-outward algorithm within two steps.
First of all, expected using frequency of rules needs to be calculated in order to determine the probability of rules, what needs to calculate is: In expression, C(·) is a frequency counter of specific rule, if a corpus after syntactic analysis can be obtained, the probability value can be directly calculated. But usually it is difficult to get an analyzed corpus; a certain rule is unknown to be used for the formation of a particular word sequence. Thus rule is given an initial probability estimate (which can be randomly chosen), then iterative algorithm is used to improve estimate.
To a single training sentence w lm , inside-outside variables are introduced, the probability of sentence w lm , i.e., ( ) ( ) Therefore the maximum likelihood estimation value of sum of ( ) Training corpus can not be only one sentence in deduction, suppose there is a training sentences set W = (W 1 ,W 2 …W ω ), hereinto W i = (w i,1 w i,2 …w i,m . Suppose f i , g i and h i represent the probability of branch nodes, preterminal nodes and other non-terminal signal nodes in analysis tree of sentence W i , the expression can be calculated: Suppose that sentences in training corpus are independent, then in revaluation process through the contribution sum of more sentences the revaluation formula is given as follows: The process of inside-outside algorithm repeats the parameters estimation, until the estimate probability of training corpus changes very little. If G i is the grammar of the No. i step iteration in training (including rules probability), so the probability of corpus corresponding to model is guaranteed to increase without reduction, that is: PCFG gives a new thinking way to build robust syntactic analysis, but PCFG has the following two reasons which restrict its application: at first the learning algorithm's convergence speed is very slow, for each sentence, each iteration time complexity in training is O(m 3 n 3 ), hereinto m is the length of sentence, n is the number of non-terminal signals in grammar. Secondly the algorithm convergence properties will become worsen sharply with the increase of the non-terminal signal number, local extremum problem is very serious. Therefore, many people improve the algorithm, the experiment in literature presents that PCFG grammar analyzer adopts ATIS corpus training, when the scale of corpus reaches 700 sentences, the iteration times is 75, the accuracy rate is 37.35% when training corpus only tags information, the accuracy of training corpus is 90.36% when doing superficial layer grammar analysis. The training of corpus in literature is WSJ after a superficial syntax analysis, when training sentences scale is 1095 sentences, the accuracy with 80 iterative times is: nodes accuracy rate is 90.22%, the accuracy rate of sentence analysis is 57.14%.
Last word: EM algorithm in the process of statistics natural language has wide range of applications, it is not directly maximizing or doing analog to the complicated posterior distribution, but based on the observation data adding some "potential data" to simplify calculation and complete a serial of simple maximizing or simulation. The characteristics of EM algorithm is simple and stable, especially each iteration can guarantee the logarithm likelihood function of observation data is monotonous without decrease, which can guarantee the likelihood function converge to a local maximum value point. But this algorithm has some pitfalls: first of all, EM algorithm is very sensitive to the setting of initial value, bad parameter initial values are easy to make the algorithm convergence value to reach some local optimization points; second, the convergence speed of EM algorithm is slow. Therefore, the training of model generally adopts the "offline" method, that is, after training model being qualified then doing application; this is against the real time process.