Font Size: a A A

Based In The English Document In Lucene Search Engine

Posted on:2009-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2208360245961309Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search engine becomes the main way to get information from the Internet. Simple web page search cannot satisfy users. Therefore, all kinds of search engines come out. Among all the internet information, word, ppt and excel documents contain large quantity of information that we need. Under such circumstances, search engine developers develop professional document search engine, such as Tianwang of Peking University. This kind of FTP document search engines is constraint in the scope of FTP servers, and the content of the FTP document is not indexed. Very few large scale search engine such as Baidu and Google indexed the content of document, nevertheless, they do not support FTP servers which is a great loss of information sources. As the scale of internet grows, the overall search result itself is a very huge collection. People need more specific and customized search engine. Therefore, flexible and easily configurable search is becoming more and more focused.The Chinese-English full-text document search engine that the author implemented differs from the existing search engine. It creatively simplifies and integrates the massive internet data. According to the attribute of simplified and integrated, we can apply this system to some specific area like document resource search for some specific website, document resource search in some specific field combined with vertical search technology. Our system first make up the deficiency of lack of information of existing document search engine; and second we can easily apply our system to document search field which has special requirement which requires low-end hardware because of the briefness, flexibility and configurable-ness of our system.This paper puts weight on the following aspects that the author implemented:1. System design and the work in purpose of optimizing on performance and extensibility.2. HTTP spider & FTP spider. Design & implement the spider for specific documents (word, ppt, excel). Illustrate the system structure, work flow, crutial components(DNS cache), design of URL overlap avoidance policy, design for Polite Nice policy, HTML page handler, design and implementation of document fetch module; illustrate the system design, performance optimization design, work flow of FTP spider; illustrate the method for document overlap avoidance, interface design & implementation for document analysis module.3. Design & implementation of document text extractor based on apache poi. Introduce the module design, implementation and optimization policy.4. Seaerch and UI modules based on apache lucene. Introduce principles of lucene, and illustrate the design for our search system and UI module concerning about the web technologies.The last chapter of the paper introduces the result of functionality test and performance test, and analyzes the work in the future which includes some optimization solutions for the current system.
Keywords/Search Tags:Lucene, chinese-english document, full-text search engine, document extract
PDF Full Text Request
Related items