Font Size: a A A

Research On Overlapping Ambiguity Treatment For Chinese Word Segmentation

Posted on:2012-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:B C WeiFull Text:PDF
GTID:2218330338470694Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Word is the smallest language unit to show semantic independently, which is the foundation of dealing with various kinds of natural language texts. It is very special to be Chinese writing, as it has not any clear sign to separate the word from word, but a continuous character string. How to cut the character string into word string, which means the automatic identification of word boundary, is a key problem that Chinese information processing need to solve in urgent. The research for Chinese word segmentation is meaningful, this paper studies the dictionary mechanism and how to discover and eliminate the intersection ambiguity in Chinese word segmentation, The main research work includes the following aspects:((1)Elaborate the research background, significance and development status for Chinese word segmentation, give a brief introduce to some distinctive segmentation system.(2)Give describe in details to those algorithm which are used in Chinese word segmentation, give some instances to describe the thought and operation. Summarize all kinds of difficulties encountered in the process and give the evaluation standard for Chinese word segmentation.(3)The core indicator of word segmentation is speed and accuracy. Through the research of several common dictionaries structure, find the advantages and disadvantages of each dictionary structure, consider on improving the efficiency of building space and the segmentation speed and the efficiency of finding, this paper choose double word Hash index option dictionary mechanism, establish the two former word Hash index, and order the remaining string to compose the body of the dictionary.(4)This paper focuses on processing of intersection ambiguity. First introduce the causes of ambiguity and classification of ambiguity. And describe the discovery algorithm and resolution algorithm. This paper find a method to find the overlapping ambiguity, that in the binary segmentation map, if a atom which is lies on the symmetry axis, if the location connected to it right and the location above it at the same time is not empty,indicating here is a intersection ambiguity. In this paper, I use the approach based on the statistical to solve the ambiguity, first describe several common methods and analyze each advantages and disadvantages, Linear superposition with double-word difference and t-test cent. Calculate CDT at each ambiguous location and then decide to segment it or not.The experimental results show that segmentation algorithm based on combination of dictionaries and statistics, the speed and precision are significantly improved compared with the traditional segmentation algorithm. However, the algorithm can not handle the combination ambiguity and unknown words, which is needed to a further study.
Keywords/Search Tags:Chinese word segmentation, Hash Index, Word segmentation algorithm, CDT
PDF Full Text Request
Related items