Font Size: a A A

Design And Implementation Of Chinese Word Analyzer Based On Lucene

Posted on:2015-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:2348330518470362Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the arrival of the digital information age, the indexing database of search engines has became bigger and bigger, and its development and maintenance cost is also increasing.Open source full-text search engine toolkit Lucene, as an excellent full-text retrieval core framework, has been used more and more widely. However, the lack of Chinese information processing capability seriously hampered Lucene's effective application in the Chinese search engine project. For this reason,this paper designed and implemented a Chinese word analyzer based on Lucene, named MySameAnalyzer, which having the function of synonyms added and was used to improve Lucene's Chinese information processing capability.First, on the basis of analyzing and comparing the existing Chinese word segmentation algorithms,this paper draw the conclusion that,to search engines,the best solution of current Chinese word segmentation is the word segmentation algorithmon that based on dictionary.Therefore, this paper proposed and designed an optimized dictionary mechanism based on Trie index tree—Map-Array combined Trie index tree dictionary mechanism, referred MACTIT dictionary mechanism. The experimental result shows that this dictionary mechanism compared with traditional Trie index tree dictionary mechanism, its time overhead and space overhead has been optimized greatly, thus, effectively improved the word segmentation algorithm's segmentation speed, saved the storage space of the word segmentation dictionary.Secondly, given that Lucene adopt the inverted index structure during search, its search speed advantage is the "whole word match" instead of "like match",therefore,to the too coarse-grained word segmentation output, despite its segmentation result looks good, but it will lead to a phenomenon that users often can not search things. To solve the above problem,this paper uses a special "most fine-grained forward iteration" word segmentation algorithm to analyze the text which need to be analyzed with the most fine-grained segmentation, in order to ensure the recall rate of retrieval system during the process of word segmentation and search operation.Again, in the word segmentation process, the ambiguity field is widespread exist, one of the most common is crossing ambiguity field. Therefore, to deal with the cross ambiguity fields which frequently encountered during the search processing,this paper stored the intersect chunks which generated in the process of word segmentation into TreeSet, by utilizing the custom sort function of TreeSet's data structure,then according to six disambiguation rules, greedily selected one of which does not intersect?approximately optimal word segmentation result to output. This word segmentation disambiguation algorithm does not traverse all possible word segmentation results that stored in TreeSet, in order to achieve the purpose of saving word segmentation time. Through experiment proved that the word segmentation algorithm designed in this paper compared to the algorithm of several other existing Chinese word analyzer which implemented the interface of Lucene, its comprehensive performance of word segmentation's speed and accuracy is the best.Finally, analyzer MySameAnalyzer designed and implemented by this paper, has fully considered the application of the search for various products and the information of names and addresses which conducted in actual search behaviors of user. Therefore, designed three sub-word analyzer for MySameAnalyzer, strengthened the processing of English?numbers?English number mixed and Chinese numbers?Chinese quantifiers in the process of Chinese word segmentation. Meanwhile, this analyzer supports the loading and the updating of user extended dictionary?user extended stop words dictionary, and also supports for user-defined synonyms adding functionality. Through experimental test shows that, these extended functionality of MySameAnalyzer designed by this paper all have achieved a good word segmentation effect in practical applications, provided a flexible and reliable Chinese processing support for Lucene and effectively improve the recall and precision rate of Lucene Chinese full-text retrieval system.
Keywords/Search Tags:Chinese word segmentation, dictionary mechanism, most fine-grained forward iteration, ambiguity processing, Lucene
PDF Full Text Request
Related items