Font Size: a A A

Statistical Based Mongolian Part-of-Speech Tagging Study And Realization

Posted on:2011-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:H YanFull Text:PDF
GTID:2178360305992514Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With computer technology, especially the rapid development and popularity of network technology, it is increasingly eager to exchange information between the natural language and computer. Therefore, the natural language information processing has been unprecedented attention and concern by many researchers at home and abroad. The part-of-speech tagging (POS) is the basis for natural language information processing, and the tagging accuracies first-hand impacts the follow-up studies. Currently, a lot of related research on the aspects of Chinese Automatic Speech Tagging has been done by researchers, and some significant results have been achieved. But, the relevant Mongolian automatic part-of-speech tagging studies still lack.The Mongolian automatic part-of-speech tagging is studied, and a Mongolian automatic part-of-speech tagging system based on statistics is implemented in the paper. The training corpus is trained using above system with hidden Markov model. Two important model parameters, that is, the part of speech the word transition probability matrix and probability distribution matrix are received. The model parameters to be used VITERBI automatic part-of-speech tagging algorithm. In the paper, sparse data problem of the Hidden Markov Model is solved using the word segmentation and linear interpolation method. And, the reducibility of POS tagging accuracy as a result of data sparse is avoided to a certain extent.Finally, using the system to the Mongolian automatic POS tagging when Mongolian segmentation before and after is made by the following test. First, corpus of different sizes is made under a closed test and an open test. Then, a closed set test and open test are respectively marked when part-of-speech tagging set are 2 and 3. Test evaluation criteria were used in POS tagging accuracy and part-category words disambiguation accuracy. Under the scale of 950,000 words corpus as training corpus, the 50000 words test set is tested. Experimental results show that the POS tagging accuracies and disambiguation accuracy rate under a closed test are about 97.9% and 85.9% respectively, and relevantly are about 97.6% and 85.5% under an open test.
Keywords/Search Tags:Mongolian Part-of-Speech Tagging, Statistical Method, Hidden Markov Model, VITERBI Arithmetic
PDF Full Text Request
Related items