Font Size: a A A

Study Based On Hadoop Distributed Web Crawler

Posted on:2016-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YueFull Text:PDF
GTID:2298330467991019Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of network techniques, there are more and more website on the internet so that it cannot storage big data for some certain large website by simple web crawler. That’s why people need to apply Distributed storage technology. Hadoop is a kind of software framework, which has the function of Hadoop Distributes File System and MapReduce. Therefore, it plays an important role for the web crawler based on Hadoop.In this thesis, first of all, we analyze Hadoop Distributes File System and web crawler techniques. Moreover, we modify the algorithm of computing weights and establish the general frame of Distributes web crawler. Finally, we design and implement each module of web crawler. Here are some main techniques as follows:(1)In the past, the traditional algorithm of URL weights only consider the directory depth and importance of webpage, we add a consideration for the importance of content of webpage to improve the precision of URL weights in improved algorithm.(2)During the process of catching web crawler, it need to frequently analyze URL, which leads to pressure overload for DNS servers. In this thesis, we apply DNS cache technology. It can directly apply some results analyzed and preserved in the cache when it analyze the URL under the same host in a short time.(3)In order to solve the problem of catching duplicate link during the process of catching, we apply Bloom Filte for URL to eliminate repeatability. In the module of updating, we design an updating algorithm for webpage. It adds a new URL into non-visited URL quene when webpage changes.In this thesis, based on the frame of the Hadoop Distributes, we test the performance of tread and node in the web crawler. And then, we analyze some results we obtained. Finally, our improved algorithm obtain higher efficiency of catching, comparing with the traditional Distributes web crawler.
Keywords/Search Tags:Distributed web crawler, web grab algorithm, MapReduce, HDFS, Hadoop
PDF Full Text Request
Related items