Font Size: a A A

Research On The Classified Method Of Inconsistency Of Segmentation For Chinese Corpus

Posted on:2007-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:X MiaoFull Text:PDF
GTID:2178360185951004Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Corpus is a language resource of the essence. Without the Support of the corpus, Chinese information processing will walk with difficulty. Corpus is of great application value in nature language processing. Through the coupus, which can provide a lot of phenomena taking place in real-life language, people can observe and grasp the language fact, study and analyze the language rules. Corpus also provides full and accurate information for acquiring language knowledge, building language model, researching information processing technology.It's a essential project to building a corpus with high quality and large scale, which guarantees other results of researches that base on corpus. So far, a perfect larger-scale Chinese corpus has not been produced all over the world. The vital problem is the corpus' quality. One of the important standards to measure a corpus' quality is if it has a high consistency on word segmentation.In the process of building a Chinese corpus, it's inevitable that inconsistency of segmentation is produced, especially for a large-scale one.Now, the research on word segmentation was mainly focused and a lot of method and arithmetic were brought forth. The research on inconsistency of segmentation, however, was seldom reported. For improve the Chinese corpus' quality, this thesis studied the inconsistency of segmentation, made some job as follows:1 , Based on the statistic and analysis of the inconsistency of segment forChinese corpus that has 1.5 million Chinese characters, we defined the main types of structure for the segment inconsistencies2> A rule-based classing method was put forward. 19 rules that were used to class and tag the inconsistent segment were induced by manual work. These rules are effective for 50% classify.3 ^ The statistical methods, mutual point information and t test, was used on the thesis to classify, conbined the nearest neiborhood strategy. Through a experiment on data including 1 million words, a probability value and a feature vector were got to class and tag classify.4n a test was done on the corpus. The result prove the methods put forward in the thesis is effective.In the process, we found it is hard to fine one method to resolve this question. So we use a combined method to class. Using classing method, the inconsistent segmentation which have the same structure and similar function were gathered together, can be treat with the uniform segmentation format.The goal of the thesis is to sign inconsistent segmentations with a classed tag and suggested tag. In the open test, 76% CP was got.
Keywords/Search Tags:Chinese information processing, corpus, word segmentation, inconsistency of segmentation
PDF Full Text Request
Related items