Font Size: a A A

The Design And Implementation Of Web Information Retrieval System

Posted on:2004-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:2168360095957191Subject:Electronic and Information Engineering
Abstract/Summary:PDF Full Text Request
We study Web as a massive information resource with rapidly evolving nature. In particular, we will describe in this thesis a high performance architecture and reliable mechanism for gathering, analyzing, and processing vast amount of web pages. The main contributions include:1) Based on an understanding of web pages and their distribution, a scalable architecture for gathering web pages is proposed, and a thorough study of the architecture is provided. Combining cluster-based parallel processing technology with the demanding requirement of crawling through vast amount web information, this architecture demonstrates a reasonable trade-off in crawling strategy, communication reduction, load balancing, task scheduling, and granularity control. Through a process of design, simulation, and implementation, a system is constructed and put in operation, demonstrating excellent scalability in the range of 1 to 18 processing nodes and having reached our performance goal: crawling 57 million web pages in 15 days.2) Aimed at the problem that nodes may occasionally fail in long crawling process, a scheme is proposed for dynamic system reconfiguration. The scheme is based on a two-phase mapping between URLs and processing nodes, which ensures that upon a change of configuration (number of nodes), the system reaches a new steady state after a short and safe transit period.
Keywords/Search Tags:World Wide Web, Search Engine,Scalable Web crawling, Dynamic reconfiguration, Load balancing, Web mining
PDF Full Text Request
Related items