Design Of A Parallel Web Crawling System

Posted on:2008-05-04

Degree:Master

Type:Thesis

Country:China

Candidate:X F Zhang

Full Text:PDF

GTID:2178360212974353

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The advent of search engine established a bridge between users and information they want. But with the fast increasing of web pages, search engines are not able to crawl these pages completely. Even the update of the pages that have been downloaded is becoming a big problem. How to crawl more important pages and update them more efficiently under limited resources have been focused in the research of modern search engines.This paper introduces the history and architecture of modern search engines, discusses the problems exist in the crawling system and updating strategy. Based on the analysis of technologies and strategies currently used, this paper addresses a new design of a parallel crawling system with high flexibility and extensibility. With this design, the efficiency of the system is advanced.The system introduces a heuristic crawling strategy based on page rank and path rank. The visiting orders of web pages are sorted by their importance that was decided by such two values. So, the qualities of the pages that have been crawled are advanced. The system also introduces an updating strategy based on Bayesian algorithm, by which can classify web pages into different classes according to changing frequency. Then these pages are going to be updated with different intervals according to their classes. This kind of strategy can produces a high freshness of web pages with less cost.

Keywords/Search Tags:

search engine, web crawler, parallel system, heuristic search, incremental crawling, updating strategy

PDF Full Text Request

Related items

1	The Theme Of The Search Engine Web Spider Search Strategy Study
2	Research On Web Crawler Algorithm Based On Topic Strategy
3	Design And Implementation Of Search Engine System Based On The Incremental Crawler
4	Design And Implementation Of Focused Crawler
5	Research On Web Crawling Technology In Image Search Engine
6	Research And Application Of Vertical Search Engine Key Technologies Based On The Lucene
7	The Design And Implementation Of Topical Search Engine
8	Research And Implementation Of Subject-oriented Dual-bound Web Page Crawling Methods
9	Research And Implementation Of The Strategy-Extensible Search Engine
10	The Design And Research Of Topic Web Crawler In Vertical Search Engine