Font Size: a A A

Research On Extensible Hash Based Dynamic Load Balancing For Parallel Web Crawling

Posted on:2011-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:S X SunFull Text:PDF
GTID:2178330338989576Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, along with the development of Internet, information published through Internet has been enriched greatly. Search Engine has become a necessary tool for daily life. And web crawler, which is an automate program for gathering web pages from Internet, is a critical part for Search Engines. As the scale of Internet grows, there're more challenges with the quantity and freshness of the pages that Search Engines gathered. As a result, the requirements of crawlers are higher than ever before. Crawlers not only need to fetch more pages with less resource consumption, but also be able to crawling under constrains.This paper mainly focuses on how to improve crawler performances, and studies load balancing problem of crawler from two aspects: the static work load distribution and the dynamic work load repartition between web crawlers.Firstly, after studying relative technologies of parallel crawler, this paper presents static work load distribution strategy based on extensible hashing and logic two-level node mapping. Extensible hashing system is discussed and an improved lazy algorithm is stated, which avoid splitting buckets and updating hash table continually, caused by unusual distribution of pseudo key values.Secondly, the critical problem of multi-task parallel crawler, dynamic work-load balancing between crawlers, is studied. This paper introduces LW(load weigh) to describe the load state of cluster node more accurately, and then a hyper graph-based model is constructed to accurately model both communication cost in the application and migration cost to move data. Furthermore, it presents an effective multi-level approaches to improve the hyper graph repartitioning quality significantly.At last this paper introduces the design issues of a distributed crawler and some key modules'implementation.
Keywords/Search Tags:parallel wed crawling, dynamic load balance, extensible hash, hyper graph repartition
PDF Full Text Request
Related items