With the rapid development of Internet technology, information resource on the internet is increasing at an astonishing rate. For the need of information construction, a lot of enterprises and institutions have set up their own websites, in order to provide information services to the users and improve their popularity and service quality. As time goes on, many websites stored a large amount of pages. However, few websites have their own intranet search engine system. In this condition, the users may not find their interested information quickly.Some web search engines, such as Google, yahoo and baidu, provide intranet information retrieval function for the users. But the search results from the websites in which the web search engines are used as intranet search engine are of poor quality, inaccurate and can't display the information as soon as possible. The reasons for the poor results can be concluded as follows:first, the web search engines index only no more the one-third of the pages on the internet; second, the web search engines usually refresh their indexed pages in a fixed period. Therefore, for those websites which did not provide information retrieval system, to build their own information retrieval system becomes the urgent matter.Through deeply research of the search engine system, full-text retrieval technology and Lucene, this thesis designs and implements an intranet search engine system based on Lucene. This thesis introduces the requirement analysis of the system, the architecture of the system, the development tools of system implementation and the design and implementation of the sub-module in detail. The system has four function modules:information gathering module, indexing module, searching module and human-computer interaction interface. As the key component of the search engine, the information gathering module is used for collecting documents for indexing from designated website. Web crawler designed for information gathering are used to traverse the website according certain access rules, and download the pages to the local server. To some extent, performance of the web crawler determines the searching effectiveness of a search engine system. This theis introduces a multi-thread web crawler for information gathering. Indexing is important for the search engine. The quality of the index determines the quality of the search results and the effectiveness of the search engine. This thesis designs an indexing framework based on Lucene, which can index many document formats, such as HTML,Word,Excel,Powerpoint and so on. This thesis also describes the process of indexing and document parsing. After the index constructed, searching module can be used to provide searching service for the users. Based on the clean and easy to use design principle, we designed the searching input interface and displaying search results interface.For system further-development, we adhere to object-oriented principle in system design and good coding convention in system development. Experiments show that the system has a good indexing and retrieval efficiency and performance, which can provide intranet information retrieval service for users effectively. |