Font Size: a A A

Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System

Posted on:2012-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:X L YuFull Text:PDF
GTID:2218330371951809Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information resources, how to find required information from the huge amount of information has been paid more and more attention. And full-text retrieval technology which can solve this problem is the main technology.Lucene is an open-source full-text retrieval component, which can be conveniently for secondary development to achieve full-text retrieval system. But in practice there are still many aspects to be improved, particularly in its handling of Chinese word segmentation. The Chinese word segmentation is good or bad, directly affect the satisfaction of users for search results, so the Chinese tokenizer is the main researched content in this paper.Firstly, this paper elaborated the relevant technology about Lucene full-text retrieval, elaborated the existing Chinese word segmentation method, analyzes the shortage of Lucene's two tokenizers ChineseAnalyzer and CJKAnalyzer, and puts forward the two-way maximal matching parting-words algorithm. And analyzes the Lucene limitations for document format, proposes a general text parsing framework.The main task of this paper is the design and implementation of a research document full-text retrieval system based on Lucene. It analyzes framework and function modules of the system, the overall design and detailed design of the system. According to the diversity of document format, it constructs the text parsing module which can parse various document formats. The system's Chinese Analyzer is realized by the improved Chinese parting-words arithmetic. It analyzes text parsing module, Chinese Analyzer and the system's performance. The experimental results show that the Chinese analyzer's effect is very significant and the recall ratio and the precision rate of the system all reached the user's satisfaction.At last, research document full-text retrieval system based on Lucene is analyzed in this paper. The achievements made in this paper are summarized and the further tasks in the future are prospected.
Keywords/Search Tags:Full-text Retrieval, Lucene, Chinese Parting-words, Text Parsing
PDF Full Text Request
Related items