Font Size: a A A

The Research On A Lucene-based Full-text Retrieval Model

Posted on:2008-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:J HuangFull Text:PDF
GTID:2178360215495645Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Full-text indexing and retrieval is an efficient way to retrieve specific information from a numerous and complicated database. Lucene, a member project of the open source organization Apache Jakarta, is a toolkit which can be easily employed to achieve information indexing and retrieval in application systems.The core and extended libraries in Lucene enable automatic Chinese word segmentation in the same way that English words are segmented. However, due to the difference between these two languages, the results are rough and the efficiency is poor. Based on a detailed study about the full-text retrieval approach which the Lucene core uses to conclude Chinese words segmentation and analysis principle on words in Lucene core, this paper presents a Chinese words segmentation module, which is based on word library and uses the positive direction maximum matching algorithm. This module is then implemented and tested. Experimental results show that it is more effective and efficient than both the single Chinese word segmentation approach used in the Lucene core library and the binary segmentation approach used in the Lucene extended library targeting CJK(Chinese, Japanese and Korean) languages. Furthermore, in order to be employed in application systems in an easy and seamless manner, Lucene core is designed to be compact and for plain-text data only. However, more and more electronic information is stored in formatted documents instead of plain-text files nowadays. In order to solve this problem, interfaces are introduced in the full-text retrieval system model proposed in this paper. With the help of dynamic instantiation, it can effectively process documents of various formats such as txt, xml, html, pdf, doc and rtf, etc. This approach helps not only to neglect the difference between various formats but also to achieve higher extensibility.After a brief review of the system design and implementation, this paper discusses some important problems, such as the correctness and recall rate of Chinese word segmentation, processing of the retrieval results, the implementation of a query interface and the strategy for index updating, etc., which could help further improvement. Some concluding remarks are then given at the end of the paper.
Keywords/Search Tags:full-text retrieval, Chinese words segmentation, formatted documents, Lucene
PDF Full Text Request
Related items