Home            Contact us            FAQs
    
      Journal Home      |      Aim & Scope     |     Author(s) Information      |      Editorial Board      |      MSP Download Statistics

     Research Journal of Applied Sciences, Engineering and Technology


Domain biased Bilingual Parallel Data Extraction and its Sentence Level Alignment for English-Hindi Pair

Deepa Gupta, Vani Raveendran and Rahul Kumar Yadav
Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bangalore, India
Research Journal of Applied Sciences, Engineering and Technology  2014  6:1187-1198
http://dx.doi.org/10.19026/rjaset.7.379  |  © The Author(s) 2014
Received: March 19, 2013  |  Accepted: May 10, 2013  |  Published: February 15, 2014

Abstract

Creation of Parallel Corpora and efficient corporal alignment at sentential level for structurally distinct languages having relatively low degree of correlation remains a challenge. This work emphasizes the importance of domain biased parallel data collection and a structured methodology to obtain the same for English-Hindi language duet. Further, its sentential alignment has also been undertaken since the participating languages are structurally distinct. In essence two aspects of this study is collection of parallel corpora from different domains and aligning the extracted parallel corpus at sentence level. The proposition is intended to help researchers in the field of Natural Language Processing help contribute better in terms of accuracy, precision and robustness of their proposition. This being possible only with availability of abundant parallel corpora and more so only if the parallel corpora are available domain wise and aligned at least at sentence level. The language pair considered for the development of the algorithm is English-Hindi. The algorithm being generic in nature makes our proposition scalable to other like structured language pairs.

Keywords:

Cost calculation, Natural Language Processing (NLP), non-official data, normal distribution, official data, parallel corpus collection, semi-official data, sentential alignment,


References

  1. Aziz, W. and L. Specia, 2011. Fully automatic compilation of Portuguese-English and Portuguese-Spanish parallel corpora. Proceeding of 8th Brazilian Symposium in Information and Human Language Technology (STIL-2011). Cuiaba, Brazil.
  2. Baker, P., A. Hardie, T. McEnery, R. Xiao, K. Bontcheva, H. Cunningham, R. Gaizaukas, O. Hamza, D. Maynard, V. Tablan, C. Ursu, B.D. Jayaram and M. Leisher, 2004. Corpus linguistics and south asian languages: Corpus creation and tool development. Lit Linguist Comput., 19(4): 509-524.
    CrossRef    
  3. Bojar, O., P. Straňák and D. Zeman, 2010. Data issues in English-to-Hindi machine translation. Proceeding of the International Conference on Language Resources and Evaluation (LREC 2010). Valletta, Malta, May 17-23.
    PMCid:PMC3351951    
  4. Chaudhury, S., D.M. Sharma and A.P. Kulkarni, 2008. Enhancing effectiveness of sentence alignment in parallel corpora: Using MT & heuristics. Proceeding of the International Conference on Natural Language Processing (ICON-2008). Macmillan Publishers, India.
  5. Church, K.W., 1993. Char align: A program for aligning parallel texts at the character level. Proceeding of the 31st Annual Meeting of the Association for Computational Linguistics. Columbus, Ohio.
  6. Gale, W.A. and K.W. Church, 1991. A program for aligning sentences in bilingual corpora. Proceeding of 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, California, pp: 1-8.
    CrossRef    
  7. Haruno, M. and T. Yamazaki, 1996. High performance bilingual text alignment using statistical and dictionary information. Proceeding of the 34th Annual Meeting on Association for Computational Linguistics, pp: 131-138.
    CrossRef    
  8. Kay, M. and M. Roscheisen, 1993. Text translation alignment. J. Comput. Linguistics-Special Issue Using Large Corpora, 19(1): 121-142.
  9. Sennrich, R. and M. Volk, 2011. Iterative, MT-based sentence alignment of parallel texts. Proceeding of the Nordic Conference of Computational Linguistics. Riga, May 11-13, pp: 1-10.
  10. Singh, T.D. and S. Bandyopadhyay, 2010. Semi-automatic parallel corpora extraction from comparable news corpora. Polibits, 41: 11-18.
    CrossRef    
  11. Somers, H., 1998. Further experiments in bilingual text alignment. Int. J. Corpus Linguist., 3: 115-150.
    CrossRef    
  12. Wu, D., 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. Proceeding of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, New Mexico, pp: 80-87.
    CrossRef    
  13. Yu, Q., A. Max and F. Yvon, 2012. Aligning bilingual literaryworks: A pilot study. Proceeding of the Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, Canada, pp: 36-44.

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online):  2040-7467
ISSN (Print):   2040-7459
Submit Manuscript
   Information
   Sales & Services
Home   |  Contact us   |  About us   |  Privacy Policy
Copyright © 2024. MAXWELL Scientific Publication Corp., All rights reserved