Font Size: a A A

The Research Of Full-text Search Engine Key Technology Based On Lucene

Posted on:2008-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z B WuFull Text:PDF
GTID:2178360215995648Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Full-text indexing and retrieval is a quick and effective measure that searchesfor specific information from a numerous and complicated database. As a member ofthe projects of Apache Jakarta, Lucene is a full-text search engine package, andsearch engine is the fast and effective shortcut for gaining information resource, thenetwork spider technology is the key to the search engine.This article concerns about the full text retrieval and the network spider, whichare two leading topics of the research area. Combined with the intelligent searchengine fi'ame, it realized network spider roaming in Internet, storing the page data inthe local database, indexing the content of the page by means of Lucene, andbuilding the full text search system which is suitable for using in the modernenterprises and schools.At present, Lucene can only index to the plain text data, however, as a result ofthe rapid development of network, plain text is rarely used. And far between, variouskinds of common documents and multimedia documents have become the mainbody of information exchanging in network day by day. Therefore, this article tookadvantage of the multi-thread network spider, along with the interface method basedon the Lucene full text retrieval system, aimed to deal with html, pdf, word andExcel and so on, and to make the indexing of these kinds of documents possible. Itsprominent merit and the characteristic lies not only in shielding the differences ofvarious kinds of document's to the greatest extent for the user, but also inexpanding the types of document's which may be processed by Luecne enormously.The research mainly includes: At first I analyzed the working principle of thesearch engine, second I elaborated and analyzed the JAVA multithreading technology,and uses the open source project Quartz to renew, especially the Socket which usedto connect, the JDBC connection, the JAVA data stream (IO) and Lucene the full textretrieval.Then, on-line experiment in the campus network and data taking fromdatabase, has proved this network spider's feasibility and proved the system had achieved the anticipated goal.Finally, this article introduced full text retrieval engine based on the LUCENE,realized the search engine which can be used in the modern enterprises and theschools. The system uses the JAVA language to develop and ECLIPSE to be thedeveloping environment. The database uses SQLSERVER2000 and the systemdesign uses the JAVA language various aspects, for example, multithreading.
Keywords/Search Tags:Network Spider, Full text retrieval, Multithreaded, common documents, Lucene
PDF Full Text Request
Related items