Font Size: a A A

Research On Mongolian Lexical Analysis Based On Combination Of Statistical And Rule Approaches

Posted on:2012-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhaoFull Text:PDF
GTID:2218330368490961Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Word is the smallest, used-independently linguistic units and the basic unit of natural language processing. Lexical analysis is a fundamental topic in natural language processing, mainly including word segmentation and tagging problems.In Linguistics, according to the language morphological structure dividision the Chinese belongs to analytic language, while the Mongolian is an adhesive language. The morphological structure of Mongolian and Chinese words is different. Specifically, the Chinese word has no specific additional ingredients representing grammatical meaning and very few morphological changes, while Mongolian words have specific additional ingredients representing grammatical meaning, which is the affix. Usually affixes represent one grammatical meaning. The word of Mongolian is composed of stem and affix.Nowadays, the Mongolian Lexical Analysis is a hot research topic in lexical analysis field. In this paper, the method of the combination of Statistical and Rule-based approach is adopted to study Mongolian Lexical Analysis in terms of the characteristics of Mongolian morphological structure. The key points contained in the paper are as follows:1) A Generative Probabilistic language mode is designed according to Mongolian word morphological structure and the characteristics of Mongolian stem and affixes. Firstly, the Mongolian sub-lexical analysis is described as a directed graph. The graph nodes represent the stem, affixes and their corresponding tagging, and the edges show the transferring and generative relationship among the nodes. The training data is converted into language model by using training program. Secondly, the decoder find out the optimal lexical analysis result by using the above language model and dynamic programming algorithm. Experiments show that the generative statistical language model can significantly improve word-level Mongolian joint segmentation and tagging accuracy rate, reaching 93.5%.2) Mongolian lexical analysis research is conducted by using the combination of statistical and rule approaches and adding "Mongolian syntax information dictionary" used as linguistic rules in the basis of the generative statistical model mentioned above. The generative statistical model with linguistic rules has the following improvements: On the one hand, during the dynamic programming sentence decoding process, the linguistic rules give higher probability of the correct value of the candidate results by using stem as trigger condition. On the other hand, the follow-up corrected treatments are carried out in accordance with the parts of speech, such as the proper nouns. Experiments show that the integration of statistical and rule methods is more effective than the simple statistical methods and the correct rate can reach 95.2% in test corpus.3) The center of Mongolian word is stem. The research on stemming is specially conducted due to its importance in Mongolian Lexical Analysis. An automata model is designed and implemented, which represents Mongolian words as the stem-centered master-slave structure.
Keywords/Search Tags:natural language processing, lexical analysis, Mongolian word segmentation, Mongolian part of speech tagging, stemming for Mongolian
PDF Full Text Request
Related items