Research And Implementation Of The Tibetan Part Of Speech Tagging System

Posted on:2013-10-23

Degree:Master

Type:Thesis

Country:China

Candidate:M Z M Yang

Full Text:PDF

GTID:2235330362963355

Subject:Chinese Ethnic Language and Literature

Abstract/Summary:

PDF Full Text Request

Tibetan POS tagging is a basic subject in the Tibetan languageinformation processing technology, Its research result not only lays thefoundation for the researches on machine translation systems, search engines,and many other areas of network information security, is also an essentialprerequisite of Tibetan language information processing subsequent syntaxanalysis, semantic analysis and text analysis. Study on the Tibetan part ofspeech tagging is an important work of natural language understanding.Therefore, research and implementation of Tibetan part of speech taggingsystem is of great theoretical significance and practical value.In this article, first it describes the significance and purpose of theTibetan part of speech tagging, reviewing POS tagging in research at homeand abroad; As the basis for Tibetan speech tagging, it studies the commonTibetan word segmentation method, ambiguity and the unknown wordsrecognition problems in word processing, and proposes a "verb first cut","split+word combination method" etc. to eliminate the mixed ambiguityproblem in the Tibetan word segmentation. By Tibetan "affix incorporation","word fragments consolidation" and "POS information revised segmentationmethod", it addresses the identification of unknown words in the TibetanWord segmentation to enhance the accuracy of the Tibetan Wordsegmentation. On this basis, it studies on the establishment of the Tibetanparts of speech knowledge base and the corpus of Tibetan language; finally itapplies rule-based and statistical method of combining to design andaccomplish a Tibetan part of speech tagging system.In order to achieve the Tibetan part of speech tagging system, aftermerging, weighing, filtering and organizing over90,000entries from Tibetandictionary of common Tibetan language such as the Tibetan-Chinesedictionary, the newly edited Tibetan dictionary, and the dictionary of Tibetanverbs, more than70,000entries are completed with POS tagging andestablished POS knowledge base; Taken Tibetan literature, folklore, history,Tibetan language textbooks for primary school are as the materials, morethan120,000selected corpus are finished by artificial POS tagging, and assources of statistical data information, through the hidden Markov Model(HMM) statistical training corpora, it acquires the vocabulary of probabilityand part of transition probabilities to establish a language model.This paper applies simple and effective given minimum value smoothing algorithm to solve sparse data problem produced in statistical data.It effectively avoids the decline of accuracy brought about by the datasparseness problem. Finally, it applies the Viterbi algorithm to select theoptimal sequence of POS tagging.This paper not only systematically studies the Tibetan Wordsegmentation and POS tagging theory, but also addressed the identificationof unknown words in the Tibetan Word segmentation ambiguity; andestablishes POS knowledge base and manual annotation corpus; it resolvesthe Tibetan part of speech tagging and word processing, and byprogramming it accomplishes a Tibetan POS tagging system. It has beentested that with this system, the accuracy rate of the open corpus of POS is89.56%.The accuracy rate of the closed test corpus is95.09%.

Keywords/Search Tags:

Tibetan language information, processingTibetan word segmentation, Tibetan POS tagging

PDF Full Text Request

Related items

1	Tibetan Segmentation And POS Tagging Study
2	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On Pre-trained Language Models
3	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On GNN
4	Text Analysis Of Speech Synthesis Based On Statistical Parameters Of Tibetan Language In Specific Fields
5	Research On Syntactic Analysis Based On Tibetan Dependency Tree Enhancement
6	The Research On Tibetan Automatic Word Segmentation Technology
7	Research On Word Segmentation And Part-of-speech Of Tibetan On Neural Network
8	A Study Of The Information Processing In Tibetan Provebs Corpus Building
9	Tibetan Newspaper Yul Phyogs So So â€™i Gsarâ€™ Gyur Me Lons As New Impetus In Tibetan Literary Writing And Modernization Of Tibetan Language
10	Research On Automatic Notation Of Word For Tibetan Corpus Based On Hmm