Font Size: a A A

Research Of Distributed Web Crawler Based On Hadoop

Posted on:2017-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2348330485985017Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the rapid growth of web page information, in the face of complex and huge amounts of information on the Internet, it is very difficult for a stand-alone Web crawler to capture and process vast amounts of data, duing to the limitations of stand-alone computer's calculative ability and storage capacity. Hadoop distributed computing platform developed by the Apache foundation has a high availability, high scalability, high extensibility in processing and store huge amounts of data, which makes the Hadoop technologies quickly became a popular choice in the field of mass data processing. Web crawler technology combined with the Hadoop distributed computing platform effectively solves the problem of massive web data capture, storage, and analysis. So the distributed web crawler based on Hadoop has very important research value and significance. In this thesis, we research and analyze the two algorithms in the web crawler: Link analysis algorithm and URL duplicate removing algorithm, and improve the performance of the algorithm in Hadoop environment.PageRank algorithm is Google's method used to identify the importance of web pages. And under the environment of big data, every URL of the outlinks will form a spill file in the PageRank algorithm based on Hadoop, which make the size of Map function output file become very big. These results file transmission through the network to Reduce side, which lead to spend too much time in the network transmission, causing low computational efficiency. To solve this problem, in the fourth chapter, the PageRank algorithm based on Hadoop is improved in this paper. According to characteristics of the Web graph composed of URL links, the Web graph will be divided into some subgraphs. More computing concentrate in the Map phase, which reduced the map function output file size, thereby reduced the MapReduce intermediate network transmission time and improve the efficiency of the algorithm.The duplicate URL removal is also an important algorithm of web crawler. The algorithm filters the duplicate URL before adding to the URL queue, which improves the performance of web crawler. In the fifth chapter, the Bloom Filter algorithm is studyed to remove the duplicate URL from web crawler. The Bloom Filter is a data structure which has higher space efficiency, and has a low time complexity on inserting and querying operation. The amount of its space has nothing to do with the element in it. But with elements added, the false postitive also increase obviously, which delete the useful urls falsely and impact performance of crawler. In order to solve this problem, the fifth chapter of this thesis, an improved algorithm based on dynamic master-slave bloom filter structure is proposed to improve the standard filter, only when both two filters generate false positive, the improved filters generate false positive. And with the elements increase, the improved Bloom Filter delays the growth rate of false positive by increasing the number of filter. Finally, the BloomFilter algorithm and the improved algorithm based on Hadoop are realized, and reduced the rate of false positive.
Keywords/Search Tags:Distributed web crawler, Hadoop, PageRank algorithm, BloomFilter algorithm
PDF Full Text Request
Related items