Font Size: a A A

The Research Of The Bloon Filter In Distributed Crawlers

Posted on:2018-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:X T ZhangFull Text:PDF
GTID:2348330515481991Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network technology,the total amount of information stored in the Internet has increased geometrically.These massive complex information stored on the Internet provides people more sources of information while also increasing the burden of searching for useful parts.In this context,distributed as the core technology,the cloud computing is rapidly developed.However,how to effectively eliminate the repeated information in the retrieval process and improve the retrieval efficiency has always been a focus of Internet research.At present,in all the red key retrieval algorithm,Bloom filter is a relatively perfect one,the principle is the use of multiple hash function of the corresponding source data for spatial mapping compression,and the use of bit arrays to store and represent the simple collection.In the query process,only through a Hash mapping process to determine whether an element belongs to this collection,with less space occupancy rate and higher query efficiency,which has a good application in a number of areas.The purpose of this paper is to reduce the error rate of the Bloomer filter in the re-filtering of the distributed web crawler,and when the amount of web page URL data obtained exceeds the upper limit of the element array that the Bloom filter design can handle,the algorithm itself can has some tolerance for the overloaded data.First of all,this paper starts from the working principle of Bloom filter,finds out the defects of Bloom filter in the re-application of distributed web crawler,and analyzes the causes of defects.Secondly,an improved Bloom filter algorithm,which is more suitable for distributed web crawler URL elimination module,is proposed by referring to some improvement algorithms of Bloom filter.Finally,by implementing a simple Hadoop-based distributed web crawler tool to verify the effect of the improved Bloom filter algorithm in the actual distributed web crawler URL elimination application.Through the analysis of its actual implementation effect,which can provide a reference for further improvement of the application of Bloom filter in the distributed web crawler URL.
Keywords/Search Tags:Bloom Filter, URL Filter, distributed web crawler, Hadoop
PDF Full Text Request
Related items