The Design And Implementation Of Web Information Retrieval System

Posted on:2004-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Liu

Full Text:PDF

GTID:2168360095957191

Subject:Electronic and Information Engineering

Abstract/Summary:

PDF Full Text Request

We study Web as a massive information resource with rapidly evolving nature. In particular, we will describe in this thesis a high performance architecture and reliable mechanism for gathering, analyzing, and processing vast amount of web pages. The main contributions include:1) Based on an understanding of web pages and their distribution, a scalable architecture for gathering web pages is proposed, and a thorough study of the architecture is provided. Combining cluster-based parallel processing technology with the demanding requirement of crawling through vast amount web information, this architecture demonstrates a reasonable trade-off in crawling strategy, communication reduction, load balancing, task scheduling, and granularity control. Through a process of design, simulation, and implementation, a system is constructed and put in operation, demonstrating excellent scalability in the range of 1 to 18 processing nodes and having reached our performance goal: crawling 57 million web pages in 15 days.2) Aimed at the problem that nodes may occasionally fail in long crawling process, a scheme is proposed for dynamic system reconfiguration. The scheme is based on a two-phase mapping between URLs and processing nodes, which ensures that upon a change of configuration (number of nodes), the system reaches a new steady state after a short and safe transit period.

Keywords/Search Tags:

World Wide Web, Search Engine,Scalable Web crawling, Dynamic reconfiguration, Load balancing, Web mining

PDF Full Text Request

Related items

1	Nutch-Based Distributed Search Engine Design And Research
2	Designing a scalable dynamic load-balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations
3	Research On Extensible Hash Based Dynamic Load Balancing For Parallel Web Crawling
4	Research And Application On Focused Crawling Search Engine Based On The Lucene
5	Research And Implementation Of The Strategy-Extensible Search Engine
6	The Research Of Web Page Crawling Strategy For Topical Search Engine Based On Web Mining
7	Research On Web Crawling Technology In Image Search Engine
8	Vertical Search Engine For Crawling The Web Page Design And Implementation
9	Tenth graders' search knowledge and use of the World Wide Web
10	A Study On Web Service In Search Engine