Font Size: a A A

Tibetan Automatic Word Segmentation And Part-of-speech Tagging Research

Posted on:2017-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:S G D LuoFull Text:PDF
GTID:2358330485467437Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the deep research of language information processing, Tibetan information processing has changed to language information processing from word processing. Tibetan lexical analysis and syntactic analysis is the primary task for Tibetan semantic understanding, information retrieval, and machine translation.Many domestic research institutions and scholars in the field of Chinese" natural language processing, has developed a more mature system, such as, LTP of Harbin industrial university, FudanNLP of Fudan university, which have pushed the progress and development of Chinese natural language processing. However, the foundation of Tibetan language information processing research is weak. Although there are a lot of published research articles, few public system^ are available.This paper presents Tibetan Word segmentation and POS based on Conditional Random Fields (CRFs) model, solves the special issues of Tibetan, and summarizes rules to realize automatic correction. A Tibetan Word segmentation and POS system based on web is designed, which conduct large-scale Tibetan text data collection automatically, XML format conversion and Tibetan Word segmentation and POS. This system lay the foundations for the application of intelligent information processing.Three main contributions and innovations are summarized as follows:1. Tibetan Word segmentation and POS model based on large-scale corpus construction.This paper collect text corpus from ten Tibetan websites such as People's Daily (Tibetan channel), Xinhua net (Tibetan channel), Tibet's news, Qinghai Tibetan radio network and so on. Text corpus involves news, entertainment, poetry, and culture and religion domain.35.1M text corpus (1 million words) and 78.5M text corpus (398 million words) was built for word segmentation model, which is a large-scale corpus. Experiments were carried to analysis the influence of each feature selected for model construction. To determine a good feature set by screening and get a well-performance Tibetan Word segmentation and POS result.2. Word segmentation and POS based on knowledge fusion.Good performance of Word segmentation and POS model based on statistics methods has achieved in disambiguation segmentation and unknown word discovery, but error segmentation and label still exists. This paper conducts detailed analysis of the results of Word segmentation and POS based on statistics methods, sums up the error type, then establishes the rules, which can revise the error of Word segmentation and POS results automatically to get the finally Word segmentation and POS results.Results of open set experiments shows that accuracy, recall, F value of the Word segmentation are 96.11%,96.03%,96.06%, and the accuracy of POS is 98.75%, which can meet the practical requirements basically.3. An open, practical Word segmentation system was implemented.Tibetan information processing has weak research basis, none available public software, and little public text corpus. This paper designs and implements a Tibetan Word segmentation and POS system based on web, which is convenient and practical, and improves the performance continually to promote the development of Tibetan information processing.
Keywords/Search Tags:Tibetan, Word segmentation, POS, CRFs, Knowledge fusion
PDF Full Text Request
Related items