Application Of Conditional Random Fields In Mongolian Word Segmentation

Posted on:2010-01-18

Degree:Master

Type:Thesis

Country:China

Candidate:W Zhao

Full Text:PDF

GTID:2178360278967870

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Mongolian word segmentation is an essential subject of Mongolian information processing, the specific content is automatically identifying stems and affixes which constitute a Mongolian word by computer system. Mongolian stems and affixes contain a large number of grammatical information; using this information will help improving the performance of Mongolian machine translation, information extraction, information retrieval, and so on. In recent years, researchers made some preliminary researches on Mongolian word segmentation and made some achievements: the dictionary and rule-based segmentation method has a accuracy rate of 0.86, the SK.IP-N statistical methods based segmentation method has a accuracy rate of 0.939. On the whole Mongolian word segmentation research began rather late, the research depth is shallow, the accuracy of word segmentation cannot meet the practical requirements.It is described the status and meaning of the Mongolian word segmentation research, and comparatively analyzed the existing Mongolian words segmentation methods. Although different in the realization, existing Mongolian word segmentation methods are heavily dependent on manual generated segmentation rule set. This study is based on the idea of the Mongolian corpus and statistical linguistics methods, the first time regard Mongolian word segmentation problem as a sequential labeling problem, rather than rely on artificial segmentation rules.It is introduced the probabilistic graph model theory related to the sequential labeling problem in this paper, and comparatively analyzed several commonly used model in sequential labeling problem, pointed out that the conditional random fields model is a probabilistic graph model that can represent overlapping features and eliminate the label bias problem. Based on the analysis of the characteristics of configuration of Mongolian words, proposed a new tag sets that using different tags respectively on stems and affixes. In order to take advantage of word level context information, also proposed a training model based on sentences. The experiment results show that the tag set that distinct stem and affix has higher results on accuracy rate and other evaluation index over the tag set that not distinct stem and affix. The experiment results also show that the use of word-level context information can help improving the accuracy of word segmentation, word based training model has a word segmentation accuracy rate of 0.988, and sentence based training model has a word segmentation accuracy rate of 0.991.

Keywords/Search Tags:

Mongolian, word segmentation, stem, suffix, conditional random fields, sequential labeling

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation With Conditional Random Fields
2	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
3	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
4	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
5	The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment
6	Research Of Named Entity Recognition Based On Conditional Random Fields
7	Study On The Tibetan Word Segmentation And Named Entity Recognition With Conditional Random Fields
8	The Research Of Chinese Word Segmentation Based On CRF
9	The Research Of Applying Conditional Random Fields To Chinese Lexical Analysis And Chunk Parsing
10	Text Categorization Based On The Conditional Random Fields