| Tibetan POS tagging is a basic subject in the Tibetan languageinformation processing technology, Its research result not only lays thefoundation for the researches on machine translation systems, search engines,and many other areas of network information security, is also an essentialprerequisite of Tibetan language information processing subsequent syntaxanalysis, semantic analysis and text analysis. Study on the Tibetan part ofspeech tagging is an important work of natural language understanding.Therefore, research and implementation of Tibetan part of speech taggingsystem is of great theoretical significance and practical value.In this article, first it describes the significance and purpose of theTibetan part of speech tagging, reviewing POS tagging in research at homeand abroad; As the basis for Tibetan speech tagging, it studies the commonTibetan word segmentation method, ambiguity and the unknown wordsrecognition problems in word processing, and proposes a "verb first cut","split+word combination method" etc. to eliminate the mixed ambiguityproblem in the Tibetan word segmentation. By Tibetan "affix incorporation","word fragments consolidation" and "POS information revised segmentationmethod", it addresses the identification of unknown words in the TibetanWord segmentation to enhance the accuracy of the Tibetan Wordsegmentation. On this basis, it studies on the establishment of the Tibetanparts of speech knowledge base and the corpus of Tibetan language; finally itapplies rule-based and statistical method of combining to design andaccomplish a Tibetan part of speech tagging system.In order to achieve the Tibetan part of speech tagging system, aftermerging, weighing, filtering and organizing over90,000entries from Tibetandictionary of common Tibetan language such as the Tibetan-Chinesedictionary, the newly edited Tibetan dictionary, and the dictionary of Tibetanverbs, more than70,000entries are completed with POS tagging andestablished POS knowledge base; Taken Tibetan literature, folklore, history,Tibetan language textbooks for primary school are as the materials, morethan120,000selected corpus are finished by artificial POS tagging, and assources of statistical data information, through the hidden Markov Model(HMM) statistical training corpora, it acquires the vocabulary of probabilityand part of transition probabilities to establish a language model.This paper applies simple and effective given minimum value smoothing algorithm to solve sparse data problem produced in statistical data.It effectively avoids the decline of accuracy brought about by the datasparseness problem. Finally, it applies the Viterbi algorithm to select theoptimal sequence of POS tagging.This paper not only systematically studies the Tibetan Wordsegmentation and POS tagging theory, but also addressed the identificationof unknown words in the Tibetan Word segmentation ambiguity; andestablishes POS knowledge base and manual annotation corpus; it resolvesthe Tibetan part of speech tagging and word processing, and byprogramming it accomplishes a Tibetan POS tagging system. It has beentested that with this system, the accuracy rate of the open corpus of POS is89.56%.The accuracy rate of the closed test corpus is95.09%. |