Research And Implementation Of Distributed Web Crawler

Posted on:2017-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:M Guo

Full Text:PDF

GTID:2348330512469375

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

In the era of big data, stand-alone crawler is not suitable for the collection of massive web information. As a result, it is inevitable trend to research distributed crawler. At present, some domestic and abroad major internet companies have developed large-scale distributed crawler by themselves. However, those technical schemes of crawler are not informed. The open source community also has some constantly updated distributed crawler projects, yet these heavyweight projects tend to have a series of questions, such as complex configuration, and difficult usage. Based on above questions, this thesis is aimed to research a lightweight distributed crawler based on Hadoop. The contents of this thesis would be presented as follows:Firstly, this thesis studies an important algorithm in the crawler system-duplicated URL detection algorithm, and analyses merits and drawbacks of main algorithms. In order to guarantee crawler system lightweight, distributed technology and stand-alone duplicated URL detection algorithm are combined, then this thesis presents a distributed duplicated URL detection algorithm based on MapReduce. This algorithm not only can solve the inefficiency question of stand-alone algorithm deals with massive data, but also is able to combine with lightweight crawler system.Therefore, this algorithm can make module which deduplicates URL in crawler system be lower-coupling and higher-cohesion, and ensure crawler system run efficiently.Secondly, this thesis designs a efficient lightweight distributed crawler by using two core component of Hadoop, namely HDFS and MapReduce. In addition, it designs a detailed scheme, which includes frame, process procedure, distributed functional module, distributed storage and etc.Thirdly, according to design scheme, we makes use of JAVA to realize system, and tests it by deploying it on the Hadoop cluster which has different scales, including function testing and performance testing. Through recording and analyzing, the distributed crawler designed by this thesis has ability of collecting massive Web information, and favourable extendibility.

Keywords/Search Tags:

distributed crawler, duplicated URL detection algorithm, Hadoop

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Web Crawler Based On Hadoop
2	Product Tag Extraction Based On User Reviews Under Distributed Crawler
3	Research And Implementation Of Distributed Web Crawler Based On Hadoop
4	Study Based On Hadoop Distributed Web Crawler
5	Research And Implementation On Key Technologies For Distributed Universal Web Crawler System
6	Research Of Distributed Web Crawler Based On Hadoop
7	Distributed Crawler Based On Hadoop
8	Distributed Web Crawler Technology Based On Hadoop
9	Design And Implementation Of A Distributed Web Crawler System Based On Hadoop
10	Research On Optimization Of Hadoop Distributed Web Crawler System