Study Based On Hadoop Distributed Web Crawler

Posted on:2016-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Yue

Full Text:PDF

GTID:2298330467991019

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of network techniques, there are more and more website on the internet so that it cannot storage big data for some certain large website by simple web crawler. That’s why people need to apply Distributed storage technology. Hadoop is a kind of software framework, which has the function of Hadoop Distributes File System and MapReduce. Therefore, it plays an important role for the web crawler based on Hadoop.In this thesis, first of all, we analyze Hadoop Distributes File System and web crawler techniques. Moreover, we modify the algorithm of computing weights and establish the general frame of Distributes web crawler. Finally, we design and implement each module of web crawler. Here are some main techniques as follows:(1)In the past, the traditional algorithm of URL weights only consider the directory depth and importance of webpage, we add a consideration for the importance of content of webpage to improve the precision of URL weights in improved algorithm.(2)During the process of catching web crawler, it need to frequently analyze URL, which leads to pressure overload for DNS servers. In this thesis, we apply DNS cache technology. It can directly apply some results analyzed and preserved in the cache when it analyze the URL under the same host in a short time.(3)In order to solve the problem of catching duplicate link during the process of catching, we apply Bloom Filte for URL to eliminate repeatability. In the module of updating, we design an updating algorithm for webpage. It adds a new URL into non-visited URL quene when webpage changes.In this thesis, based on the frame of the Hadoop Distributes, we test the performance of tread and node in the web crawler. And then, we analyze some results we obtained. Finally, our improved algorithm obtain higher efficiency of catching, comparing with the traditional Distributes web crawler.

Keywords/Search Tags:

Distributed web crawler, web grab algorithm, MapReduce, HDFS, Hadoop

PDF Full Text Request

Related items

1	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
2	Design And Implement Of Video Crawler System Based On Hadoop
3	Distributed Web Crawler Technology Based On Hadoop
4	Research On Distributed Processing Of Massive Video Data Based On Hadoop
5	Design And Implementation Of Web Data Acquisition System For Wireless City
6	Design And Implementation Of Distributed Network Crawler System
7	Research And Implementation Of Distributed Web Crawler
8	Research And Application Of Telecom Big Data Processing Based On Hadoop
9	Analysis And Application Development Of Hadoop Distributed Computing Platform
10	The Performance Optimization And Improvement Of MapReduce In Hadoop