Font Size: a A A

Research On Chinese Segmentation And Unlisted Words Identification For Chinese Information Retrieval

Posted on:2008-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:C ChengFull Text:PDF
GTID:2178360242465395Subject:Information Science
Abstract/Summary:PDF Full Text Request
Chinese segmentation is a fundamental task in Chinese information processing. Chinese segmentation algorithm for Chinese information retrieval should be capable of identification of ambiguities and the unlisted words.This paper conducted an in-depth study on the unlisted words identification. And then according to the thinking process of identifying new words when people is reading, it came up with a new unlisted words identification algorithm which is composed of several rules, such as the rule of identification of numerals and quantifiers, the rule of border words, the rule of auxiliary empty words, the rule of unlisted words identification based on memory and the rule of right or left detecting methods to identify unlisted words. The algorithm doesn't rely on large corpus and can effectively identify various types of unlisted words in multi-fileds. At the same time, by comparing the results of the bidirectional segmentation algorithm, the algorithm identifies the most common crossing ambiguities to make identification of unlisted words and crossing ambiguities integrative. Thus, the problem that the new ambiguities are emerging in the process of unlisted words identification is solved effectively. Then, the dictionary organizational structure and the query words algorithm were greatly improved, therefore the efficiency of segmentation and the capability of dictionary were enhanced, and simultaneously the dictionary updating and maintaining are more flexible.On this basis, this paper firstly analyzed the characteristics of information retrieval system and its requirements for the segmentation algorithm and then presented a self-adaptive segmentation algorithm, developed a module named CarmmLib.dll based on the segmentation algorithm and the Chinese self-adaptive segmentation system named Carmm for information retrieval. Carmm is extendible and transplantable, in which users are allowed to customize the dictionaries and segmentation results, maintain the common dictionary and the unlisted words dictionary.Finally, Carmm and ICTCLAS free version were fully evaluated and compared with each other in the aspects of the system performance (basic performance, system load, stability), segmentation accuracy rate and identification accuracy and recall rate of unlisted words. The Carmm's segmentation speed is about 100KB/s steadily. In an open evaluation of the People's Daily Corpus, the segmentation accuracy rate of Carmm is about 91.2%. In an open evaluation of the latest web documents, the segmentation accuracy rate which is about 90.1% is closer to the accuracy rate of ICTCLAS free version which is 91.3%; the unlisted words identification accuracy rate of Carmm which is 91.2% is slightly lower than the accuracy rate of ICTCLAS free version which is 93.9%; the unlisted words identification recall rate of Carmm which is 94.7% is significantly higher than the recall rate of ICTCLAS free version which is 89.0%. Meanwhile, the performance of the Carmm is better than that of ICTCLAS free version in terms of the segmentation speed rate, speed stability of identifying a large number of unlisted words, robustness of facing to the high system load, system ease of use, system anti-jamming etc.
Keywords/Search Tags:Information Retrieval, Chinese Segmentation, Unlisted Words Identification, Chinese Segmentation System
PDF Full Text Request
Related items