Font Size: a A A

Research And Application Of Open-source Full-text Search Engine

Posted on:2009-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:J P CuiFull Text:PDF
GTID:2178360272474595Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet and the rapid development of information technology, information retrieval technology, in order to get the information that needed from the mass data by the rapid and accurate access, has become the focus of attention. Among which, the full-text retrieval is a very efficient technology to retrieval information. It has greatly raised the efficiency of finding specific information from the huge volumes of data. As a member project of Apache Jakarta Open-source organization, Lucene was realized by java language, It is a high-performance, scalable information retrieval tool library. Lucene can quickly and easily integrate into applications system, to increase in the index and search functions.This paper firstly introduced the status and key technology of full-text retrieval, and then analyzed and researched the structure of the Lucene and the principle of Analyzer. I designed a thesaurus based on the analysis and research about the core of analyzer in Lucene. The thesaurus was realization based on the HashMap. Based on the thesaurus I designed a forward maximum matching method of word segment. The segmentation algorithm can segment Chinese, English and the number. The segmentation algorithm can filter punctuation, too. Test results show that the design of the sub-module MMCAnalyzer and Lucene core package adopted by word segmentation methods and its extension, in the dual segmentation method, the sub-segmentation algorithm has high efficiency, high-accuracy of the word advantage.In addition, in order to be employed in application systems in an easy and seamless manner, the core of Lucene was designed to be compact for plain-text documents only. However, presently the text-only format of the document is gradually reduced, while various other formats of which gradually increased. In order to solve the problem, this paper designed a framework which could deal with a variety of document types. The framework can effectively to deal with txt, doc, PDF and other common document types.At last, this paper summarizes the research work and indicates the access to further reference.
Keywords/Search Tags:Full-text retrieval, Lucene, Analyzer, Information extraction
PDF Full Text Request
Related items