Linear Reranking Model for Chinese Pinyin-to-Character Conversion

Xinxin Li; Xuan Wang; Lin Yao; Muhammad Waqas Anwar

doi:10.19026/rjaset.7.344

Research Journal of Applied Sciences, Engineering and Technology

Research Article | OPEN ACCESS

Linear Reranking Model for Chinese Pinyin-to-Character Conversion

¹Xinxin Li, ¹Xuan Wang, ¹Lin Yao and ^{1, 2}Muhammad Waqas Anwar

¹Computer Application Research Center, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
²Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan

Research Journal of Applied Sciences, Engineering and Technology 2014 5:975-980

http://dx.doi.org/10.19026/rjaset.7.344 | © The Author(s) 2014

Received: January 31, 2013 | Accepted: February 25, 2013 | Published: February 05, 2014

Back to issue | PDF | HTML

Abstract

Pinyin-to-character conversion is an important task for Chinese natural language processing tasks. Previous work mainly focused on n-gram language models and machine learning approaches, or with additional hand-crafted or automatic rule-based post-processing. There are two problems unable to solve for word n-gram language model: out-of-vocabulary word recognition and long-distance grammatical constraints. In this study, we proposed a linear reranking model trying to solve these problems. Our model uses minimum error learning method to combine different sub models, which includes word and character n-gram LMs, part-of-speech tagging model and dependency model. Impact of different sub models on the conversion are fully experimented and analyzed. Results on the Lancaster Corpus of Mandarin Chinese show that our new model outperforms word n-gram language model.

Keywords:

Dependency model, minimum error learning method, part-of-speech tagging, word n-gram model,

References

Collins, M., 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp: 1-8.
CrossRef
Jiang, W., G. Guan, X. Wang and B. Liu, 2007. Pinyin to character conversion model based on support vector machines. J. Chinese Inf. Proces., 21(2): 100-105.
Jiang, W., L. Huang, Q. Liu and Y. Lü, 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of ACL-08: HLT, Columbus, Ohio, pp: 897-904.
Li, X., W. Wang and L. Yao, 2011. Joint decoding for Chinese word segmentation and pos tagging using character-based and word-based discriminative models. 2011 International Conference on Asian Language Processing (IALP), Penang, Malaysia, pp: 11-14.
CrossRef
Ng, H.T. and J.K. Low, 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based? In: Lin, D. and D. Wu (Eds.), Proceedings of EMNLP 2004, Barcelona, Spain, pp: 277-284.
Och, F.J., 2003. Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp: 160-167.
CrossRef
Stolcke, A., 2002. Srilm - an extensible language modeling toolkit. Proceedings of the International Conference on Spoken Language Processing, Denver, Colorado, pp: 901-904.
Wang, X., Q. Chen and D.S. Yeung, 2004. Mining pinyin-to-character conversion rules from large-scale corpus: A rough set approach. IEEE T. Syst. Man Cy, B, 34(2): 834-844.
Xiao, J., B. Liu and X. Wang, 2007. Exploiting pinyin constraints in pinyin-to-character conversion task: A class-based maximum entropy markov model approach. Comput. Linguist. Chinese Language Proces., 12(3): 325-348.
Zaidan, O., 2009. Z-mert: A fully configurable open source tool for minimum error rate training of machine translation systems. Prague Bull. Math. Linguistics, 91(1): 79-88.
CrossRef
Zhang, Y. and S. Clark, 2008. Joint word segmentation and pos tagging using a single perceptron. Proceedings of ACL-08: HLT, Columbus, Ohio, pp: 888-896.
Zhang, Y. and J. Nivre, 2011. Transition-based dependency parsing with rich non-local features. Proceedings of the 49th Annual Meeting of the Association for Computational Linguis-tics: Human Language Technologies, Portland, Oregon, USA, pp: 188-193.

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online): 2040-7467
ISSN (Print): 2040-7459

Information

Sales & Services



Journal Home \| Aim & Scope \| Author(s) Information \| Editorial Board \| MSP Download Statistics