Font Size: a A A

Research And Implementation Of Distributed Web Crawler

Posted on:2017-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:M GuoFull Text:PDF
GTID:2348330512469375Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In the era of big data, stand-alone crawler is not suitable for the collection of massive web information. As a result, it is inevitable trend to research distributed crawler. At present, some domestic and abroad major internet companies have developed large-scale distributed crawler by themselves. However, those technical schemes of crawler are not informed. The open source community also has some constantly updated distributed crawler projects, yet these heavyweight projects tend to have a series of questions, such as complex configuration, and difficult usage. Based on above questions, this thesis is aimed to research a lightweight distributed crawler based on Hadoop. The contents of this thesis would be presented as follows:Firstly, this thesis studies an important algorithm in the crawler system-duplicated URL detection algorithm, and analyses merits and drawbacks of main algorithms. In order to guarantee crawler system lightweight, distributed technology and stand-alone duplicated URL detection algorithm are combined, then this thesis presents a distributed duplicated URL detection algorithm based on MapReduce. This algorithm not only can solve the inefficiency question of stand-alone algorithm deals with massive data, but also is able to combine with lightweight crawler system.Therefore, this algorithm can make module which deduplicates URL in crawler system be lower-coupling and higher-cohesion, and ensure crawler system run efficiently.Secondly, this thesis designs a efficient lightweight distributed crawler by using two core component of Hadoop, namely HDFS and MapReduce. In addition, it designs a detailed scheme, which includes frame, process procedure, distributed functional module, distributed storage and etc.Thirdly, according to design scheme, we makes use of JAVA to realize system, and tests it by deploying it on the Hadoop cluster which has different scales, including function testing and performance testing. Through recording and analyzing, the distributed crawler designed by this thesis has ability of collecting massive Web information, and favourable extendibility.
Keywords/Search Tags:distributed crawler, duplicated URL detection algorithm, Hadoop
PDF Full Text Request
Related items