Linear Reranking Model for Chinese Pinyin-to-Character Conversion

: Pinyin-to-character conversion is an important task for Chinese natural language processing tasks. Previous work mainly focused on n-gram language models and machine learning approaches, or with additional hand-crafted or automatic rule-based post-processing. There are two problems unable to solve for word n-gram language model: out-of-vocabulary word recognition and long-distance grammatical constraints. In this study, we proposed a linear reranking model trying to solve these problems. Our model uses minimum error learning method to combine different sub models, which includes word and character n-gram LMs, part-of-speech tagging model and dependency model. Impact of different sub models on the conversion are fully experimented and analyzed. Results on the Lancaster Corpus of Mandarin Chinese show that our new model outperforms word n-gram language model.


INTRODUCTION
Research on pinyin-to-character conversion is beneficial for many Chinese language processing tasks, such as speech recognition, intelligent inputting methods.However, this task is still difficult due to the existence of homophone.For Chinese characters, there are about 418 different syllables without tones (more or less in different standards), 3500 common characters covering 99.48% of all characters and 7000 general characters.Therefore, there might be multiple possible characters for each syllable.For example, totally 60 characters in 3500 common characters have the same syllable "yi".The task of Chinese pinyin-to-character conversion is to disambiguate the corresponding character for a syllable with other characters.
The common approach for Chinese pinyin-tocharacter conversion is n-gram Language Model (LM), which selects the word/character candidate sequence with greatest probability.The determination of word/character for current syllables depends on the probability on previous words/characters (detailed description in next section).There are two major problems for word n-gram language model.One problem is the mistaken recognition of Out-Of-Vocabulary (OOV) words.Since the OOV words are not in segmented training data, they might be given low probabilities in word n-gram LM.The other problem is that n-gram language model only embodies the constraints on nearby words, which lacks long distance constraints in grammatic or syntactic structure.
Many approaches are proposed to deal with these problems.Discriminative methods, such as maximum entropy model (Xiao et al., 2007), support vector machines (Jiang et al., 2007), can employ pinyin features after current word/character rather than only using information before current word/character in word n-gram LM.Rule-based post-processing approach brings grammatical rule constraints to candidates, where these rules can be obtained by hand-crafted or statistical methods.Wang employed rough set theory to extract rules automatically from training dataset (Wang et al., 2004).
In this study, we proposed a linear reranking model for Chinese pinyin-to-character conversion problem.Our model can utility the information from word/character n-gram LMs, character-based discriminative model, pinyin-word occurrence model, Part-Of-Speech (POS) tagging model, word-POS cooccurrence model and dependency model.We can combine these sub models by using their probability outputs for candidates.Minimum error training method is used to obtain the weights for sub models.We perform experiments on Lancaster Corpus of Mandarin to analyze the effect of different sub models on pinyinto-character conversion and the results show that the information from sub models helps the conversion.

PINYIN-TO-CHARACTER CONVERSION
Given a syllable sequence S = s 1 , s 2 , …, s n , the aim of Chinese pinyin-to-character conversion is to find where, P(S|C) and P(C) are calculated similar as P(S|C) and P(W).
When applying word n-gram LM, a beam search decoding method is used for Chinese pinyin-tocharacter conversion.Normally, P (S|W) and P (S|C) are omitted in the decoding phase.For each position j in syllable sentence, the decoding algorithm will generate all words in vocabulary according to current syllable sequence s i , .., s j ending on position j and calculate the probability p(w c |w c-2 ,w c-1 ) of current word w c over previous words.For every position j, it keeps the n-best word sequences.The best word sequence will be the one with greatest probability on sentence end.

Preliminary experiments:
We perform experiments on The Lancaster Corpus of Mandarin Chinese (LCMC) to analysis the performance of word n-gram LM and then compare it with single character n-gram LM, oracle of mixed result with character n-gram LM and oracle   1.
The training data for our word and character ngram LMs are taken from People's Diary of year 1998, 2006, 2007 and 2009-2012.To obtain the full list of Chinese words is impossible because the specification of word segmentation is not determined and the number of words is still grown.Even all words are given, it's inpractical because an accurate word n-gram LM needs a huge training data and the size of the trained language model is too huge to use.Our word dictionary for word n-gram LM contains 130750 words selected from four sources: the 7000 general characters, the 56064 common words, words from XinHua dictionary and 94412 highest frequency words extracted from Google Chinese web 5g version 1.For character n-gram LM, the word dictionary contains only the 7000 general characters.We use a minimum word number method to segment the raw training text and SRILM tool to train our language model (Stolcke, 2002).The Out-Of-Vocabulary (OOV) words in Both models are also evaluated by IV word error rate, OOV word error rate.The OOV words are defined on table 1 are Chinese words not in word dictionary.The length of both IV words and OOV words is large than 1.Table 2 shows that word n-gram LM achieves better performance than character n-gram LM in all evaluation measures, providing evidence that word ngram LM provides more constraints than character ngram LM.And word n-gram LM achieves better performance on IV words than OOV words.
Oracle of mixed result of word and character ngram LMs and oracle of k-best word lists are calculated to compare with word n-gram LM.The k-best word lists are generated using beam search algorithm on word-based model.Table 2 shows that the word and character n-gram LMs cause different errors and oracle of their mixed result obtains better performance.The oracle of k-best word sequences achieves the best CER.The fact that the performance of word n-gram LM is much lower than both oracles leaves us a space to improve it by using other information.

LINEAR RERANKING MODEL
Experimental results show single word n-gram LM is insufficient for Chinese pinyin-to-character conversion.We proposed a linear reranking model to combine different information.In this study, several sub models are introduced into our model, including part-of-speech tagging model, pinyin-word cooccurrence model, word-POS occurrence model and dependency model.To utility these different sub models, we can combine their probability outputs of each candidate and use minimum error training method to obtain the weight for these models.The k-best candidates for each pinyin sentence are first generated by word n-gram LM using beam search algorithm and then POS tagged and dependency parsed.Our linear reranking model is used to select the best one from these candidates.
Minimum Error Training (MERT) algorithm is proposed to combine different features for machine translation (Och, 2003).It has been successfully used in joint word segmentation and POS tagging (Jiang et al., 2008).The probability of a possible corresponding word sequence W for a syllable sequence S is calculated as: The weights w j (1≤j≤k) can be obtained by keeping other weights fixed and calculating each weight w j iteratively.The probability of the candidate is: The left probability w j *P j (W|S) is an variable, the right probability ˱ * ˜ {ˣÉ˟ { is an constant.A direct grid search algorithm for determining each weight is not suitable for the task, since it's expensive to re-calculate the probabilities of all candidates to find the best one.MERT employs a piece-wise linear search algorithm for each jth dimension.The optimum value of each weight w j must be in the intersection values of all lines depicted by the formula above.It's easy to find the optimum one because when the critical value changes, only a few candidates need to be calculated.The details of MERT algorithm has been described in Zaidan (2009).
Next, we will describe the sub models we employed and the method of calculating the probability output of the candidate for each sub model.

Character-based discriminative model:
The character-based model is a discriminative model determining each character in a sentence using the information only from pinyin sequence.The model is trained using averaged perceptron algorithm.The pinyin feature templates are listed in Table 3. P(S|C) is the probability output.The features Φ(S, C) can be generated before decoding, so the time for decoding can be enormously reduced.The details of decoding algorithm and parameter estimation method can be found in Collins (2002) and Li et al. (2011).
Pinyin-word co-occurrence: For syllables s, there will be multiple possible corresponding words w.For a syllable sequence S and its corresponding word sequence W, we define pinyin-word co-occurrence as: We can statistic the number of syllables N(s i ) and syllable-word pairs N(wi, s i ) from annotated dataset.The training dataset is the same as the dataset for word n-gram LM.
POS tagging model: Given the word sequence W, we can determine its Part-Of-Speech (POS) tags.Previous work reveals that POS information improves word segmentation (Ng and Low, 2004;Zhang and Nivre, 2011).In this study, we introduce POS tagging model trying to help selecting the best word sequence.Averaged perceptron algorithm is used for parameter estimation.
The POS tag sequence T for word sequence W is chosen as: The features Φ (T, W) for POS tagging model are defined similar as Zhang and Clark (2008), shown in Table 4.
The training data for POS tagging is taken from Chinese Treebank (CTB5) and the distribution of training, development and test dataset is same as Zhang and Clark (2008).The F-1 measure of our POS model on golden segmented development dataset is 95.26%.

POS N-gram language model:
The POS n-gram LM is trained using the data described before and POS tagged.We define the probability of POS sequence T as: POS-word co-occurrence: The POS-word Cooccurrence is defined similar as syllable-word cooccurrence.For word sequence w and its corresponding POS sequence T, the co-occurrence is defined as: where, p(w i |t i ) and p(t i |w i ) are also calculated using MLE method.
Dependency model: For a long sentence W, the determination of a word for syllables might not be determined only using the information from nearby words and syllables.It might also need grammatic and syntactic information.For example, in a syllable sequence "yi zhi mei li ke ai de xiao hua mao".The determination of word for syllables "yi zhi" is dependent on its relation with syllables "xiao hua mao".Dependency model brings long-distance word relation for Chinese pinyin-to-character conversion.For a segmented and POS tagged sentence (W, T), we use a deterministic transition-based algorithm for dependency parsing (Zhang and Nivre, 2011).Figure 2 gives an example of a dependency tree structure.The probability of dependency tree D is:

EXPERIMENTS AND ANALYSIS
The experimental dataset for linear reranking model is the same as preliminary experiments on word n-gram LM in section 2. Same as calculating oracle of k best word sequences, we take 500 candidate dependency trees for each syllable sequence.Then linear reranking model is used to select the best one from these candidate ones.Character Error Rate (CER) is used as our evaluation measure.
Experiments on sub models: All our sub models can generate a probability for a give dependency tree D, as we listed in section (Linear Reranking Model).We train these sub models on training dataset and evaluate the linear reranking model on development dataset.We first employ a backward greedy search algorithm to find the suitable sub model sets.The procedure starts with the linear model with all sub models and evaluates it on development data.First, it will iteratively remove each sub model and re-evaluate the performance.Then, the linear model will remove the sub model with greatest improvement on development data.The procedure will continue until the performance is not improved.
The results are shown in Table 5.After using the sub model selection strategy, word-pinyin occurrence P occur (T|W) is eliminated because it doesn't the improve the performance of the linear model.We compare the linear reranking model of all rest sub models and the models with removing one sub model each time.
Table 5 shows that all the sub models have positive impact for the task, in which word n-gram LM improves the performance most, character N-gram LM ranks second and pinyin-word occurrence ranks third.The dependency model also brings an improvement on development data, better than POS-word occurrence model.
We then evaluate the impact of each sub model on the task.The experimental results are shown in Fig. 3.The numbers 0, .., 8 in X axis denote sub models listed in Table 5.The above curve in Fig. 3 describes the results of only one sub model used for evaluation.The results reveal similar phenomenon as Table 5 that word n-gram LM achieves the lowest CER and character ngram LM ranked the second.The dependency model gives the lowest accuracy.
The below curve shows the results of linear reranking models of word n-gram LM combining with one other sub models.The linear reranking models outperform all single sub models.Given word n-gram LM, adding pinyin-word occurrence model achieves the best performance where it's omitted in the original decoding algorithm described in (Pinyin-to-Character Conversion).

Experiments on sub model group:
We then evaluate different groups of sub models to validate which information has the most influent on the conversion.All sub models are split into four groups: character group, pinyin group, POS group, dependency group.The results in Table 6 show that except word n-gram model, the pinyin group provides the most useful information and the dependency model gives the less benefit.
Figure 4 shows experimental results of the linear reranking models beginning from word n-gram LM and adding one sub model group each time.The entire linear reranking model benefits from every sub model group.The final model achieves 1.51 points decrease than word N-gram LM, about 14.49% increase on performance.

Complexity analysis:
The space and time complexity of the linear reranking model is vital for Chinese pinyin-to-character conversion.All sub models can be generated along with character appending in sentence with linear time complexity of sentence length O (n).Then our linear reranking model is also linear with sentence length since it's a linear combination of these sub models.

Experiments on test data:
We then evaluate our models on test data.The experimental results exhibit similar performance as development data.With more sub models added, the performance of linear reranking model increases.The final model achieves 10.94 CER, about 8.89% decrease than 12.03 CER of word n-gram LM.

Fig. 1 :
Fig. 1: An example for pinyin-to-character conversion a sequence of characters C = c 1 , c 2 , …, c n , or a sequence of words W = w 1 , w 2 , …w n , where each word w k is composed of one or more characters c i , …, c j .An example of Chinese pinyin-to-character conversion is shown in Fig. 1.Using word n-gram LM, word sequence W for inputting syllable sequence S satisfying: * arg max ( | ) arg max ( | ) ( ) / ( ) W P sub (W, C, T, D|S) represents the probability output of sub models, including word ngram LM P(W), character n-gram LM P(C), characterbased discriminative model P(S|C), pinyin-word cooccurrence model P occur (W|S) and P occur (S|W), POS tagging model P(T|W), POS N-gram LM P(T), POSword occurrence model P occur (W|T) and a dependency model P(D|W, T).The first line in the equation is valid because a word sequence W corresponds to only one POS sequence T and dependency tree D.
Fig. 2: An example of a dependency tree

Fig. 3 :
Fig. 3: Performance of sub models on development data

Table 1 :
Statistics of training, development and test data

Table 2 :
Performance of different LMs on development data of k-best word sequences generated by word n-gram LM.LCMC contains 15 different text categories, with totally 45735 segmented, POS tagged, pinyin annotated sentences.The original pinyin text is annotated using pinyin4j tool which only considers the most probable syllable for each character separately.We re-annotated the pinyin of sentences using minimum word number matching algorithm on Sogou dictionary.The reannotated pinyin sequence is more accurate than the original one.To evaluate word n-gram LM, we selected 200 sentences from each category as development and test dataset separately and the rest 39735 sentences as training data.All the training, development and test sentences are first segmented by Chinese end punctuations and then filtered by eliminating those containing English words.The statistics of our training, development and test data are listed in Table Table 1 are defined as words that are not contained in word dictionary.
The performance of word and character n-gram LMs are evaluated on development dataset.The results are shown onTable 2. Both models are evaluated on Character Error Rate (CER) measure: the number of correct characters CER the number of all characters =

Table 5 :
Experimental results of linear reranking model on