Research On The Classified Method Of Inconsistency Of Segmentation For Chinese Corpus

Posted on:2007-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:X Miao

Full Text:PDF

GTID:2178360185951004

Subject:Computer application technology

Abstract/Summary:

Chinese Corpus is a language resource of the essence. Without the Support of the corpus, Chinese information processing will walk with difficulty. Corpus is of great application value in nature language processing. Through the coupus, which can provide a lot of phenomena taking place in real-life language, people can observe and grasp the language fact, study and analyze the language rules. Corpus also provides full and accurate information for acquiring language knowledge, building language model, researching information processing technology.It's a essential project to building a corpus with high quality and large scale, which guarantees other results of researches that base on corpus. So far, a perfect larger-scale Chinese corpus has not been produced all over the world. The vital problem is the corpus' quality. One of the important standards to measure a corpus' quality is if it has a high consistency on word segmentation.In the process of building a Chinese corpus, it's inevitable that inconsistency of segmentation is produced, especially for a large-scale one.Now, the research on word segmentation was mainly focused and a lot of method and arithmetic were brought forth. The research on inconsistency of segmentation, however, was seldom reported. For improve the Chinese corpus' quality, this thesis studied the inconsistency of segmentation, made some job as follows:1 , Based on the statistic and analysis of the inconsistency of segment forChinese corpus that has 1.5 million Chinese characters, we defined the main types of structure for the segment inconsistencies2> A rule-based classing method was put forward. 19 rules that were used to class and tag the inconsistent segment were induced by manual work. These rules are effective for 50% classify.3 ^ The statistical methods, mutual point information and t test, was used on the thesis to classify, conbined the nearest neiborhood strategy. Through a experiment on data including 1 million words, a probability value and a feature vector were got to class and tag classify.4n a test was done on the corpus. The result prove the methods put forward in the thesis is effective.In the process, we found it is hard to fine one method to resolve this question. So we use a combined method to class. Using classing method, the inconsistent segmentation which have the same structure and similar function were gathered together, can be treat with the uniform segmentation format.The goal of the thesis is to sign inconsistent segmentations with a classed tag and suggested tag. In the open test, 76% CP was got.

Keywords/Search Tags:

Chinese information processing, corpus, word segmentation, inconsistency of segmentation

Related items

1	Cascade Consistency Check Of Segmentation Of The Chinese Corpus
2	Comparative Research On Open-Source Chinese Word Segmentation Machines
3	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
4	The Research And Implementation Of Automatic Chinese Word Segmentation System
5	Research On The Methods Of Automatic Correction Of Chinese Word Segmentation And Part-of-Speech Tagging
6	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
7	Based On The Statistics Of Open Chinese Word Segmentation
8	Research And Implementation Of Chinese Word Segmentation Algorithm
9	Research And Application Of Chinese Word Segmentation Based On English-Chinese Parallel Corpus
10	Study Of Chinese Word POI Segmentation System Based On N-Shortest-Paths And HMM