Font Size: a A A

Cascade Consistency Check Of Segmentation Of The Chinese Corpus

Posted on:2009-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2178360272463565Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the field of Chinese natural language processing,it pays attention to research and obtain knowledge automatically based on large-scale true text corpus,and it becomes primary task that building the large-scale and high-quality corpus.But because of the setting-up of the corpus now,it is need to check-up artificially,the unavoidable carelessness and mistake will cause inconsistent to the same word segmentation results under the same language environment.These inconsistent phenomena have not merely influenced the corpus precision,and take mistake to next process step that utilize corpus resource.So,while processing the corpus,we must check and collate the segmentation consistency to guarantee the corpus quality.So it is important standard that evaluate segmentation corpus quality.Aimed at the problems in the large-scale corpus,we submit a cascaded method to solve Segmentation inconsistent based on the research of inconsistent strings in SXU and MSR corpus,which deals with the segmentation consistency using the database of the rules,then,we make use of the statistic model to collate the corpus.The experience has got the anticipated aim,which proves that the combined method can solve the segmentation inconsistent availably.The main task is as follows:1.Based on the statistic and analysis of the inconsistency of segment for Chinese corpus that has 4 million Chinese characters,we defined the main types of structure for the segment inconsistencies,confirm the research object of the thesis and put the aspects which influenced the precision score as foundation of rules database;2.A rules and examples method was put forward.We extract initial rules and a lot of examples,which apply to collate results.We enhance segmentation corpus quality by using rules self-learning measure;3.We propose a statistical method which can largely enhance segmentation corpus quality;we use vector model expresses abstracted inconsistency strings and word environment,and using synonymous database when computing similarity.Computing similarity and classify measures were used to getting probability value of inconsistency strings.At last,strings classified by mensurable measure.4.Based on the above thought and the method,experimental models are designed respectively that method on the basis of rule,method on the basis of statistical,and method that combined two kinds of method together.A test was done on the corpus,the result prove the methods put forward in the thesis is effective.In the process,we found it is hard to fine one method to resolve this question.So we use a combined method to class.Using classing method,the inconsistent segmentation that have the same structure and similar functions were gathered together,can be treat with the uniform segmentation format. SXU corpus has a good achievement in SIGHAN2007 bakeoff.To carrying on open tests of the combined corpus segmentation consistency check-up system,the precision of consistency checking is 84.50%,the recall is 70.39%, it is obvious that the corpus quality can really improved after adopting the system.
Keywords/Search Tags:Corpus, Consistency Verify, Inconsistency of Segmentation, Word Segmentation
PDF Full Text Request
Related items