Chinese Word Segmentation Technology Research Based On Lucene

Posted on:2013-08-09

Degree:Master

Type:Thesis

Country:China

Candidate:X X Shao

Full Text:PDF

GTID:2248330395455519

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, databases of search e ngines becomelarger and larger, and the cost for implementation and maintenance of the databases arehigher and higher. Lucene, an open source toolkit, has been widely used in full-textindexing and retrieving. However, information in Chinese cannot be well processed inLucene. With this motivation, segmentation of Chinese words based on Lucene isstudied in this thesis. First, Lucene and the current Chinese words segmentationmethod, including dictionary mechanism, segmentation algorithm, and ambiguityprocess, are briefly introduced. Secondly, a dictionary structure is designed whichcontains, a hash table established by the first characters, and ordered hierarchy tablesconstructed by the second and the rest characters, where the words shared by the samefirst character are stored in the memory as a tree structure. Thirdly, the texts aresegmented by the forward and reverse maximum literals matching algorithm, and theambiguity segments are collected and processed by some ambiguity separating rulesand mutual information principles. Finally, according to the dictionary mechanism andthe proposed algorithm, Chinese word segmentation module is improved, and ananalyzer named MyCEAnalyzer that can process both Chinese and English words isdeveloped. Experimental results which is tested from both the dictionary performanceand word segmentation performance aspects show that the tool works well in practice.

Keywords/Search Tags:

Chinese word segmentation, Lucene, literal matching algorithm, ambiguity process

PDF Full Text Request

Related items

1	Design And Implementation Of Chinese Word Analyzer Based On Lucene
2	The Research And Application Of Chinese Word Segmentation Technology In Search Engine
3	Research Of Chinese Word Segmentation In BERSE
4	Chinese Word Segmentation Algorithm Based On Ontology Research And Implementation
5	Research Of Ambiguity Algorithm For Chinese Word Segmentation Based On Dictionary
6	Research Of Combined Chinese Word Segmentation Method
7	Research On Overlapping Ambiguity Treatment For Chinese Word Segmentation
8	Research On Automatic Segmentation Based On Dictiongary
9	Study On Disambiguation Algorithm For Chinese Word Segmentation
10	The Research And Application On Lucene And Chinese Word Segmentation