| With the popularity of network, colorful rich network resource has brought people great convenience on life, work, and study. However, the substantial quantity of information, the disorder of information and the interference of garbage have embarrassed people in making the most of network resource. In order to make network users to find the information they needed conveniently, search engine emerged as the time require. Generally, search engine is consisting of web crawler, index construction, retrieval and user interface. The system of Centaurea offers saddlebag for full-text retrieval, and it implement three parts, index construction, retrieval and user interface, of search engine. In this article, the design and implementation of Centaurea are expounded.First, we analyze the distributed file system, which is the groundwork of large-scale search engine. Then, we design the file format of inverted index, discourse upon compression of ints, which is called ByteCode, and compare it with other compression methods. We also expound the algorithm and implement-ation of index construction. Because the efficiency is most important, in order to improve the rate of index construction, we discourse upon string-int transform and management of file cache. In this paper, the solutions of distributed index construction and distributed retrieval are also discoursed.The efficiency of retrieval is an important indicator of judging the quality of search engine. In this article, the design and implementation of retrieval module are expounded, and the main factor which affects the efficiency of retrieval is analyzed. The score method is also discoursed, and the vector space model is introduced. This paper is organized by modules, including inverted file, index construction, query, scoring, etc. In the last chapter we will make an overall evolution. |