Font Size: a A A

Chinese And Mongolian Lexical Analysis Research And Its Application In Statistical Machine Translation

Posted on:2011-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y L YingFull Text:PDF
GTID:2178360308955511Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Lexical analysis is a fundamental research of Natural Language Processing (NLP). Its accuracy has a direct effect on Natural Language Processing (such as machine translation). This dissertation constructed a two-level statistical model for Chinese lexical analysis and Mongolian morphological analysis. Moreover, the lexical and morphological information has been added in Chinese-Mongolian statistical machine translation system, which has been evaluated on Aligned bilingual corpus. The results show that the addition of lexical and morphological information has improved the quality of the translation, and lexical analysis is very important for statistical machine translation.This dissertation systematically introduced the definition of Conditional Random Fields (CRFs), graphical structure of CRFs model, the potential function, feature functions, training and decoding algorithm. We simplified the graphical structure of CRFs, designed feature function, improved decoding algorithm; and applied CRFs to Chinese lexical analysis and Mongolian morphological analysis.This dissertation presented a model of Chinese word segmentation based on Local Ambiguity Word Grid and Conditional Random Fields. First, the model used Local Ambiguity Word Grid algorithm to generate rough segmentation results in the lower level. Then, segment the text again based on CRFs, and set the rough results as one feature. The system has been tested in the MSRA and PKU testing sets which are provided by the SIGHAN2005 Chinese Language Processing Bakeoff. F-measures of the system in the closed test reach 97.1% and 95.1% respectively. This dissertation has also constructed a statistical model for Chinese POS tagging, which could use more context information.sBecause of Mongolian language feature that achieves morphological changes through connecting different suffixes to stems, this dissertation uses minimum description length algorithm for segmentation of Mongolian surface forms. Then, tag Mongolian part of speech based on CRFs, and set the segmentation of surface forms results as one feature. The system is tested in the Mongolian testing sets that are provided by Inner Mongolia University.By enriching the lexical and morphological information gotten from the lexical analysis to Factored Translation Model, this dissertation has constructed several translation paths from source factor to target factor, and used several language models based on factor to evaluate the quality of the translation. Finally, the dissertation has also built a generation model from several language factors to mongolian surface forms. By incorporating more lexical knowledge of the source and target language, the translation system has significant improved the quality of the translation.
Keywords/Search Tags:Lexical Analysisis, Conditional Random Fields, Local Ambiguity Word Grid, Minimum Description Length, Statistical Machine Translation
PDF Full Text Request
Related items