Tibetan Automatic Word Segmentation And Part-of-speech Tagging Research

Posted on:2017-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:S G D Luo

Full Text:PDF

GTID:2358330485467437

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the deep research of language information processing, Tibetan information processing has changed to language information processing from word processing. Tibetan lexical analysis and syntactic analysis is the primary task for Tibetan semantic understanding, information retrieval, and machine translation.Many domestic research institutions and scholars in the field of Chinese" natural language processing, has developed a more mature system, such as, LTP of Harbin industrial university, FudanNLP of Fudan university, which have pushed the progress and development of Chinese natural language processing. However, the foundation of Tibetan language information processing research is weak. Although there are a lot of published research articles, few public system^ are available.This paper presents Tibetan Word segmentation and POS based on Conditional Random Fields (CRFs) model, solves the special issues of Tibetan, and summarizes rules to realize automatic correction. A Tibetan Word segmentation and POS system based on web is designed, which conduct large-scale Tibetan text data collection automatically, XML format conversion and Tibetan Word segmentation and POS. This system lay the foundations for the application of intelligent information processing.Three main contributions and innovations are summarized as follows:1. Tibetan Word segmentation and POS model based on large-scale corpus construction.This paper collect text corpus from ten Tibetan websites such as People's Daily (Tibetan channel), Xinhua net (Tibetan channel), Tibet's news, Qinghai Tibetan radio network and so on. Text corpus involves news, entertainment, poetry, and culture and religion domain.35.1M text corpus (1 million words) and 78.5M text corpus (398 million words) was built for word segmentation model, which is a large-scale corpus. Experiments were carried to analysis the influence of each feature selected for model construction. To determine a good feature set by screening and get a well-performance Tibetan Word segmentation and POS result.2. Word segmentation and POS based on knowledge fusion.Good performance of Word segmentation and POS model based on statistics methods has achieved in disambiguation segmentation and unknown word discovery, but error segmentation and label still exists. This paper conducts detailed analysis of the results of Word segmentation and POS based on statistics methods, sums up the error type, then establishes the rules, which can revise the error of Word segmentation and POS results automatically to get the finally Word segmentation and POS results.Results of open set experiments shows that accuracy, recall, F value of the Word segmentation are 96.11%,96.03%,96.06%, and the accuracy of POS is 98.75%, which can meet the practical requirements basically.3. An open, practical Word segmentation system was implemented.Tibetan information processing has weak research basis, none available public software, and little public text corpus. This paper designs and implements a Tibetan Word segmentation and POS system based on web, which is convenient and practical, and improves the performance continually to promote the development of Tibetan information processing.

Keywords/Search Tags:

Tibetan, Word segmentation, POS, CRFs, Knowledge fusion

PDF Full Text Request

Related items

1	Research On The Specification Of Chinese Word Segmentation Designed For Special Domain
2	Research On Dependency Parsing Of Tibetan Language Based On Deep Learning
3	Research Of Chinese Word Segmentation With Conditional Random Fields And Implementation
4	Research On Automatic Disambiguation Method Of Tibetan Word Meaning Based On Chinese And Tibetan Parallel Corpus
5	A Study On Cambodian Word Method Based On Conditional Random Field
6	Study On The Tibetan Word Segmentation And Named Entity Recognition With Conditional Random Fields
7	A Study Of Burma 's Lexical Methods
8	Based On The Same Field Crfs And Interdisciplinary Under Brand Word Extraction
9	Research On Tibetan-Chinese Neural Machine Translation Incorporating Prior Knowledge
10	The Method Of The Vietnamese Lexical Analysis Research