Font Size: a A A

Design Of A Parallel Web Crawling System

Posted on:2008-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:X F ZhangFull Text:PDF
GTID:2178360212974353Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The advent of search engine established a bridge between users and information they want. But with the fast increasing of web pages, search engines are not able to crawl these pages completely. Even the update of the pages that have been downloaded is becoming a big problem. How to crawl more important pages and update them more efficiently under limited resources have been focused in the research of modern search engines.This paper introduces the history and architecture of modern search engines, discusses the problems exist in the crawling system and updating strategy. Based on the analysis of technologies and strategies currently used, this paper addresses a new design of a parallel crawling system with high flexibility and extensibility. With this design, the efficiency of the system is advanced.The system introduces a heuristic crawling strategy based on page rank and path rank. The visiting orders of web pages are sorted by their importance that was decided by such two values. So, the qualities of the pages that have been crawled are advanced. The system also introduces an updating strategy based on Bayesian algorithm, by which can classify web pages into different classes according to changing frequency. Then these pages are going to be updated with different intervals according to their classes. This kind of strategy can produces a high freshness of web pages with less cost.
Keywords/Search Tags:search engine, web crawler, parallel system, heuristic search, incremental crawling, updating strategy
PDF Full Text Request
Related items