Chinese And Mongolian Lexical Analysis Research And Its Application In Statistical Machine Translation

Posted on:2011-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Ying

Full Text:PDF

GTID:2178360308955511

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

Lexical analysis is a fundamental research of Natural Language Processing (NLP). Its accuracy has a direct effect on Natural Language Processing (such as machine translation). This dissertation constructed a two-level statistical model for Chinese lexical analysis and Mongolian morphological analysis. Moreover, the lexical and morphological information has been added in Chinese-Mongolian statistical machine translation system, which has been evaluated on Aligned bilingual corpus. The results show that the addition of lexical and morphological information has improved the quality of the translation, and lexical analysis is very important for statistical machine translation.This dissertation systematically introduced the definition of Conditional Random Fields (CRFs), graphical structure of CRFs model, the potential function, feature functions, training and decoding algorithm. We simplified the graphical structure of CRFs, designed feature function, improved decoding algorithm; and applied CRFs to Chinese lexical analysis and Mongolian morphological analysis.This dissertation presented a model of Chinese word segmentation based on Local Ambiguity Word Grid and Conditional Random Fields. First, the model used Local Ambiguity Word Grid algorithm to generate rough segmentation results in the lower level. Then, segment the text again based on CRFs, and set the rough results as one feature. The system has been tested in the MSRA and PKU testing sets which are provided by the SIGHAN2005 Chinese Language Processing Bakeoff. F-measures of the system in the closed test reach 97.1% and 95.1% respectively. This dissertation has also constructed a statistical model for Chinese POS tagging, which could use more context information.sBecause of Mongolian language feature that achieves morphological changes through connecting different suffixes to stems, this dissertation uses minimum description length algorithm for segmentation of Mongolian surface forms. Then, tag Mongolian part of speech based on CRFs, and set the segmentation of surface forms results as one feature. The system is tested in the Mongolian testing sets that are provided by Inner Mongolia University.By enriching the lexical and morphological information gotten from the lexical analysis to Factored Translation Model, this dissertation has constructed several translation paths from source factor to target factor, and used several language models based on factor to evaluate the quality of the translation. Finally, the dissertation has also built a generation model from several language factors to mongolian surface forms. By incorporating more lexical knowledge of the source and target language, the translation system has significant improved the quality of the translation.

Keywords/Search Tags:

Lexical Analysisis, Conditional Random Fields, Local Ambiguity Word Grid, Minimum Description Length, Statistical Machine Translation

PDF Full Text Request

Related items

1	The Research And Applications Of Chinese Lexical Analysis
2	Research On Automatic Katakana Translation Technology
3	Research On Key Technologies In Thai Lexical Analysis
4	Research Of Chinese Phrase Identification Based On Conditional Random Fields
5	Research Of Chinese Word Segmentation With Conditional Random Fields
6	The Research Of Applying Conditional Random Fields To Chinese Lexical Analysis And Chunk Parsing
7	Statistical Shape Modeling Based On Minimum Description Length Optimization And Segmenting In Medical Images
8	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
9	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
10	Research Of Phrase-based Translation Model Using Syntactic And Morphologic Information