We study Web as a massive information resource with rapidly evolving nature. In particular, we will describe in this thesis a high performance architecture and reliable mechanism for gathering, analyzing, and processing vast amount of web pages. The main contributions include:1) Based on an understanding of web pages and their distribution, a scalable architecture for gathering web pages is proposed, and a thorough study of the architecture is provided. Combining cluster-based parallel processing technology with the demanding requirement of crawling through vast amount web information, this architecture demonstrates a reasonable trade-off in crawling strategy, communication reduction, load balancing, task scheduling, and granularity control. Through a process of design, simulation, and implementation, a system is constructed and put in operation, demonstrating excellent scalability in the range of 1 to 18 processing nodes and having reached our performance goal: crawling 57 million web pages in 15 days.2) Aimed at the problem that nodes may occasionally fail in long crawling process, a scheme is proposed for dynamic system reconfiguration. The scheme is based on a two-phase mapping between URLs and processing nodes, which ensures that upon a change of configuration (number of nodes), the system reaches a new steady state after a short and safe transit period.
|