Font Size: a A A

The Study Of Maximum-Match-Based Written Chinese Automatic Segmentation

Posted on:2005-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:C YangFull Text:PDF
GTID:2168360155962527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Written Chinese automatic segmentation is an important step in Chinese information processing. It is the foundation in many application fields of Chinese information, such as text checking, machine translation, text classifying, text retrieving, man-machine interface of computer, etc. At present, three main methods have been used for Chinese word segmentation, which include character matching method, statistical method and understanding method. Through analyzing the existed Chinese word segmentation algorithms, this paper emphasizes on the research of character matching method, use maximum match method to segment word firstly, then apply statistical method to ambiguous segmentation and the recognition of unknown words.According to the characteristics of more two-word words in Chinese, provide an improved dictionary mechanism, which add two-word-bitmap into the data structure of the dictionary. On this basis, we improve the maximum match method, realize a maximum match method that based on two-word-bitmap, which utilize two-word-bitmap to recognize two-word words fast, reduce the number of times of matching the dictionary, so as to enhance the speed of automatic word segmentation. As we find that the pseudo type of high frequent part of maximal crossing ambiguities is strong in coverage capacity and rather stable with regard to domain shifting, we propose for obtaining the high frequent maximal crossing ambiguities automatically, add the correct form of the high frequent maximal crossing ambiguities into the ambiguities database and clear up the ambiguous through matching the ambiguities database directly, which is a memory-based strategy in essence. The study of the recognition technology of unknown words focuses on obtaining the unknown words from Web resources, and propose an algorithm on the basis of Web query logs, which analyze query word frequency for unknown words recognition.On the basis of the researches mentioned above, we design and realize a written Chinese automatic segmentation system facing practical application. The experimental result shows: under the same condition, the improved maximum match algorithm that based on the two-word bitmap has fastened segmentation speed than original algorithm. Through testing the system by Chinese Word Segmentation Evaluation Toolkit of Carnegie Mellon University, the returned data show that the...
Keywords/Search Tags:Chinese Word Segmentation, Maximum Match, Two-Word Words, Ambiguities Segmentation, Pseudo Type of Ambiguities, Unknown Word Recognition, Precision
PDF Full Text Request
Related items