Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Technology Based On Lucene

Posted on:2015-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2268330428961409Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The rapid development of information technology enrich Internet information and accelerate the development of search engine. As one part of information retrieval, Chinese word segmentation technology is very important and promotes the development of full text retrieval technology. Then Lucene generates and develops during the process. Lucene, an open source toolkit developed by apache, is designed to implement the full-text retrieval. However it needs improvement with specific implementation, especially on Chinese information processing. So Chinese word segmentation technology of Lucene is the focus of this paper.Based on deep research of Lucene, this paper presents an improved algorithm of string matching-Matching algorithm mostly in the word, whose word segmentation is more accurate. Then MyChAnalzyer is builded, whose core module is Chinese word segmentation part of Hash word matching algorithm based on prefix of a word. Testing performance of word segmentation by various methods, mainly on speed and accuracy of word segmentation. The experimental results show that the precision of this analyzer is better than the included analyzer of Lucene. The last part of this paper proposes an improved sorting result algorithm, which combines the lucene’s sorting result algorithm with the PageRank algorithm and show the superiority of the improved algorithm by testing users experiment for the average satisfaction of each algorithm.Finally, we summarize the full text and put forward the future’s work.
Keywords/Search Tags:Chinese Word Segmentation, Lucene, Hash, Sortingalgorithm
PDF Full Text Request
Related items