With the rapid development of Internet, databases of search e ngines becomelarger and larger, and the cost for implementation and maintenance of the databases arehigher and higher. Lucene, an open source toolkit, has been widely used in full-textindexing and retrieving. However, information in Chinese cannot be well processed inLucene. With this motivation, segmentation of Chinese words based on Lucene isstudied in this thesis. First, Lucene and the current Chinese words segmentationmethod, including dictionary mechanism, segmentation algorithm, and ambiguityprocess, are briefly introduced. Secondly, a dictionary structure is designed whichcontains, a hash table established by the first characters, and ordered hierarchy tablesconstructed by the second and the rest characters, where the words shared by the samefirst character are stored in the memory as a tree structure. Thirdly, the texts aresegmented by the forward and reverse maximum literals matching algorithm, and theambiguity segments are collected and processed by some ambiguity separating rulesand mutual information principles. Finally, according to the dictionary mechanism andthe proposed algorithm, Chinese word segmentation module is improved, and ananalyzer named MyCEAnalyzer that can process both Chinese and English words isdeveloped. Experimental results which is tested from both the dictionary performanceand word segmentation performance aspects show that the tool works well in practice. |