Font Size: a A A

Search Engine Crawler's Design, Implementation And Expansion Optimization

Posted on:2010-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:S YangFull Text:PDF
GTID:2208360275483355Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Search engine is a software system which is applied on Web. It uses some strategy to discover and collect information on Web. After the information is processing and organization, it provides users with web information service.This paper firstly makes a full analysis of search engine's core technologies, including: indexer, searcher, spider, website quality assessment algorithm, lexical analysis, Chinese word segmentation techniques, inverted file technology, Boolean query theory. Then, on basic of search engine's core technologies, based on a lightweight architecture, its three main modules were designed: crawler, indexer and searcher. And mainly focus on the implementation of the web page gather module.Web page gather module: on basic of implementing its core function, this paper put forward some optimization methods as follows:1. Incremental model: only update part of the pages to refresh the page set. This mothod can significantly reduce the number of bulk update, thus improve the freshness of the page sets.2. Distributed strategy: put single collection node to multiple nodes, nodes can communicate with each other, and add a control node to coordinate other nodes.3. Website weight calculation: used to assess the importance of web pages. It uses Google's PageRank algorithm.4. Expand the disk storage method: use mechanisms such as inheritance and derivation of object-oriented language to support database storage and fault-tolerant file format.5. A new website crawl strategy: give every web page a weight number based on timer strategy when crawling pages for the first time, and decide which page is to visited first next time according to the weight number, thus avoid bandwidth waste on low-performance servers.Index module: first of all, discuss the design method of Chinese word segmentation and choose a word segmentation algorithm. Then, give out the strategy of establishing forward index file. Finally, give out the multi-level reversal index strategy. Search module: Firstly, give out the Boolean query strategy of the searcher. Then, discuss how to realize Boolean query based on reversal index file.
Keywords/Search Tags:crawler, indexer, searcher, incremental, distributed
PDF Full Text Request
Related items