Font Size: a A A

Chinese Word Segmentation Technology Research Based On Lucene

Posted on:2013-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:X X ShaoFull Text:PDF
GTID:2248330395455519Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, databases of search e ngines becomelarger and larger, and the cost for implementation and maintenance of the databases arehigher and higher. Lucene, an open source toolkit, has been widely used in full-textindexing and retrieving. However, information in Chinese cannot be well processed inLucene. With this motivation, segmentation of Chinese words based on Lucene isstudied in this thesis. First, Lucene and the current Chinese words segmentationmethod, including dictionary mechanism, segmentation algorithm, and ambiguityprocess, are briefly introduced. Secondly, a dictionary structure is designed whichcontains, a hash table established by the first characters, and ordered hierarchy tablesconstructed by the second and the rest characters, where the words shared by the samefirst character are stored in the memory as a tree structure. Thirdly, the texts aresegmented by the forward and reverse maximum literals matching algorithm, and theambiguity segments are collected and processed by some ambiguity separating rulesand mutual information principles. Finally, according to the dictionary mechanism andthe proposed algorithm, Chinese word segmentation module is improved, and ananalyzer named MyCEAnalyzer that can process both Chinese and English words isdeveloped. Experimental results which is tested from both the dictionary performanceand word segmentation performance aspects show that the tool works well in practice.
Keywords/Search Tags:Chinese word segmentation, Lucene, literal matching algorithm, ambiguity process
PDF Full Text Request
Related items