Research And Application Of Open-source Full-text Search Engine

Posted on:2009-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:J P Cui

Full Text:PDF

GTID:2178360272474595

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the popularity of the Internet and the rapid development of information technology, information retrieval technology, in order to get the information that needed from the mass data by the rapid and accurate access, has become the focus of attention. Among which, the full-text retrieval is a very efficient technology to retrieval information. It has greatly raised the efficiency of finding specific information from the huge volumes of data. As a member project of Apache Jakarta Open-source organization, Lucene was realized by java language, It is a high-performance, scalable information retrieval tool library. Lucene can quickly and easily integrate into applications system, to increase in the index and search functions.This paper firstly introduced the status and key technology of full-text retrieval, and then analyzed and researched the structure of the Lucene and the principle of Analyzer. I designed a thesaurus based on the analysis and research about the core of analyzer in Lucene. The thesaurus was realization based on the HashMap. Based on the thesaurus I designed a forward maximum matching method of word segment. The segmentation algorithm can segment Chinese, English and the number. The segmentation algorithm can filter punctuation, too. Test results show that the design of the sub-module MMCAnalyzer and Lucene core package adopted by word segmentation methods and its extension, in the dual segmentation method, the sub-segmentation algorithm has high efficiency, high-accuracy of the word advantage.In addition, in order to be employed in application systems in an easy and seamless manner, the core of Lucene was designed to be compact for plain-text documents only. However, presently the text-only format of the document is gradually reduced, while various other formats of which gradually increased. In order to solve the problem, this paper designed a framework which could deal with a variety of document types. The framework can effectively to deal with txt, doc, PDF and other common document types.At last, this paper summarizes the research work and indicates the access to further reference.

Keywords/Search Tags:

Full-text retrieval, Lucene, Analyzer, Information extraction

PDF Full Text Request

Related items

1	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
2	Research And Application Of Full Text Retrieval Technology Based On Lucene
3	The Research And Implementation Of Full-text Retrieval System Based On Lucene
4	Research And Application Of Lucene Full-text Retrieval Technology In Patent Information Service Platform
5	Research And Application Of Full-Text Retrieval Based On Lucene.Net
6	Research And Application Of Full-text Retrieval Technology Based On Lucene
7	Development And Maintenance Of Full-text Retrieval Web System Based On Lucene
8	The Design And Implementation Of The Heterogeneous Data Joint Retrieval System
9	The Research On A Lucene-based Full-text Retrieval Model
10	The Research And Implementation Of Full-text Retrieval System Based On Lucene