Font Size: a A A

Research And Application Of Bloom Filter In Duplicated Webpages Deletion

Posted on:2014-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:T HuangFull Text:PDF
GTID:2248330398452605Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the network information is growing very fast.A lot of information on the one hand to bring more sources, but it also gives people a huge burden to search for information. According to the China Internet Network Information Center CNNIC statistics in2012:In2011the global number of pages has reached86.6billion, while in2012this figure had increased to122.7billion.So how to more effectively eliminate duplicate information in the Internet, so that people are seeking to find a convenient, has become the modern Internet is an important issue.Bloom filter is1970proposed a duplicated deletion algorithm. It actually consists of a very long series of random binary vectors, and a lot of hash function.Now, it has been used in many fields. And through the study at home and abroad, The algorithm was improved.This paper based on the theory and application at the same time.Through experiment method,find the Bloom filter algorithm in the duplicated webpages deletion A better solution.First of all,this paper introduces the concept and types of duplicate webpage,sums up the reasons for duplicate webpage generation. And briefly introduces some relevant concepts.Secondly,Introduces the bloom filter and its improved algorithms.With the improvement of Bloom filters as a starting point,Selected counting Bloom filter and multi-dimensional Bloom filters, And in the theoretical analysis of them. Describes operational efficiency and the advantages and disadvantages of the three algorithms.Finally, the paper carried out experimental design. To compare the three algorithms through creating a collection of a certain size, And according to the results of the analysis came to the conclusion, pointed out the direction for further improvement of Bloom filter algorithms in duplicated web pages deletion.
Keywords/Search Tags:Bloom filter, counting Bloom filter, multi-dimensional Bloom filters, Web crawlers, Duplicated Webpages Deletion
PDF Full Text Request
Related items