The Research On A Lucene-based Full-text Retrieval Model

Posted on:2008-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:J Huang

Full Text:PDF

GTID:2178360215495645

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Full-text indexing and retrieval is an efficient way to retrieve specific information from a numerous and complicated database. Lucene, a member project of the open source organization Apache Jakarta, is a toolkit which can be easily employed to achieve information indexing and retrieval in application systems.The core and extended libraries in Lucene enable automatic Chinese word segmentation in the same way that English words are segmented. However, due to the difference between these two languages, the results are rough and the efficiency is poor. Based on a detailed study about the full-text retrieval approach which the Lucene core uses to conclude Chinese words segmentation and analysis principle on words in Lucene core, this paper presents a Chinese words segmentation module, which is based on word library and uses the positive direction maximum matching algorithm. This module is then implemented and tested. Experimental results show that it is more effective and efficient than both the single Chinese word segmentation approach used in the Lucene core library and the binary segmentation approach used in the Lucene extended library targeting CJK(Chinese, Japanese and Korean) languages. Furthermore, in order to be employed in application systems in an easy and seamless manner, Lucene core is designed to be compact and for plain-text data only. However, more and more electronic information is stored in formatted documents instead of plain-text files nowadays. In order to solve this problem, interfaces are introduced in the full-text retrieval system model proposed in this paper. With the help of dynamic instantiation, it can effectively process documents of various formats such as txt, xml, html, pdf, doc and rtf, etc. This approach helps not only to neglect the difference between various formats but also to achieve higher extensibility.After a brief review of the system design and implementation, this paper discusses some important problems, such as the correctness and recall rate of Chinese word segmentation, processing of the retrieval results, the implementation of a query interface and the strategy for index updating, etc., which could help further improvement. Some concluding remarks are then given at the end of the paper.

Keywords/Search Tags:

full-text retrieval, Chinese words segmentation, formatted documents, Lucene

PDF Full Text Request

Related items

1	Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System
2	Research Of Search Engine Key Technique And Optimize Performance
3	Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System
4	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
5	Development And Maintenance Of Full-text Retrieval Web System Based On Lucene
6	Research And Application Of Full-text Retrieval Technology Based On Lucene
7	Full-text Retrieval Of Distributed Geological Survey Data Based On Lucene
8	Research And Application Of Lucene Full-text Retrieval Technology In Patent Information Service Platform
9	The Research And Implementation Of Full-text Retrieval System Based On Lucene
10	Design And Improvement Of Website Full-text Retrieval System Based On Lucene