Font Size: a A A

Application Of Conditional Random Fields In Mongolian Word Segmentation

Posted on:2010-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhaoFull Text:PDF
GTID:2178360278967870Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Mongolian word segmentation is an essential subject of Mongolian information processing, the specific content is automatically identifying stems and affixes which constitute a Mongolian word by computer system. Mongolian stems and affixes contain a large number of grammatical information; using this information will help improving the performance of Mongolian machine translation, information extraction, information retrieval, and so on. In recent years, researchers made some preliminary researches on Mongolian word segmentation and made some achievements: the dictionary and rule-based segmentation method has a accuracy rate of 0.86, the SK.IP-N statistical methods based segmentation method has a accuracy rate of 0.939. On the whole Mongolian word segmentation research began rather late, the research depth is shallow, the accuracy of word segmentation cannot meet the practical requirements.It is described the status and meaning of the Mongolian word segmentation research, and comparatively analyzed the existing Mongolian words segmentation methods. Although different in the realization, existing Mongolian word segmentation methods are heavily dependent on manual generated segmentation rule set. This study is based on the idea of the Mongolian corpus and statistical linguistics methods, the first time regard Mongolian word segmentation problem as a sequential labeling problem, rather than rely on artificial segmentation rules.It is introduced the probabilistic graph model theory related to the sequential labeling problem in this paper, and comparatively analyzed several commonly used model in sequential labeling problem, pointed out that the conditional random fields model is a probabilistic graph model that can represent overlapping features and eliminate the label bias problem. Based on the analysis of the characteristics of configuration of Mongolian words, proposed a new tag sets that using different tags respectively on stems and affixes. In order to take advantage of word level context information, also proposed a training model based on sentences. The experiment results show that the tag set that distinct stem and affix has higher results on accuracy rate and other evaluation index over the tag set that not distinct stem and affix. The experiment results also show that the use of word-level context information can help improving the accuracy of word segmentation, word based training model has a word segmentation accuracy rate of 0.988, and sentence based training model has a word segmentation accuracy rate of 0.991.
Keywords/Search Tags:Mongolian, word segmentation, stem, suffix, conditional random fields, sequential labeling
PDF Full Text Request
Related items