Font Size: a A A

Study On An Improved Chinese Segmentation Algorithm And Its Application In Lucene

Posted on:2011-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:M FuFull Text:PDF
GTID:2178330338986031Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese Segmentation is one of the most important elements of the Chinese Information Processing. The algorithm which combined by the character matching method with the statistic method can better realize the Chinese Segmentation. This algorithm firstly segments the Chinese text by identifying the punctuation which makes the text to be short sentences with completely meaning that can promote the accuracy of character matching. Then every short sentence would be scanned and segmented by the method of Maximum Match Method and Reverse Minimum Matching Method, meanwhile, the results would be optimized based on the rules of language by optimization program which could identify the characters, letter and numbers that can strengthen the process ability of algorithm on dealing with different type of text. Finally, the ambiguousness would be eliminated by Minimal Segmentation Principle and statistic method.Chinese Segmentation algorithm has been generally defined in three ways, based on character matching, based on statistic method and based on understanding. Every of them have merits respectively. Improved segmentation algorithm which combined the merits of easy to accomplish with high efficiency that accompanied by rules of language has promoted the accuracy of basic segmentation. In practice, two times scan adopted the Maximum Match Method and Reverse Minimum Matching Method which used the strong points of less fragments of maximum matching and special ability of dealing with polysemous ambiguousness of reverse minimum matching. Characters, letter, numbers can be deal with based on rules of language at the same time of scanning. Then the numeral and classifier in Chinese, Roman numerals in English would be processed by optimization program with better solved the problem of segmentation on multi-type text. The ambiguousness eliminate processing of improved segmentation algorithm is to compare the results of scanning and output the one of them directly when they are equal. Ambiguousness would be judged as happening and should be processed by program if results of times scanning are different: To select the less fragments result as the output based on Minimal Segmentation Principle if the number of fragments are not equal, or to select the higher frequency word to output as the method of statistic when the number of fragments are equal. Another improvement of this algorithm is on constructing the structure of dictionary by adopting the method of previous two characters stored by hashtable and the rest word stored by linked list by the order of frequency. This improvement promotes the efficiency of segmentation in some way. The whole algorithm can be applied to Lucene as the composition of Chinese information searching system. From the result of experiment, this algorithm has a great improvement on accuracy compared to the segmentation system provided by Lucene.
Keywords/Search Tags:Chinese Segmentation, Two times scan, Elimination of ambiguousness, Hashtable, Lucene
PDF Full Text Request
Related items