Font Size: a A A

Research And Application Of Data Deduplication Technology Based On Bloom Filter

Posted on:2017-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:J Z HuangFull Text:PDF
GTID:2428330488479902Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,will produce large amounts of data every day,and in these data in a high proportion of duplicate data.How to quickly retrieve massive duplicated data has received more and more attention.As a kind of lightweight data structure,Bloom Filter has been widely used in data retrieval,but Bloom Filter also has shortcomings.Because it is a kind of probabilistic data structure,it exist the occurrence of the error in the data retrieval.At the same time,Bloom Filter is only for static data sets,and can not be dynamically extended storage.In recent years,there are many improvements on the shortcomings of Bloom Filter.Split Bloom Filter and Dynamic Bloom Filter solve the problem of the elements of dynamic growth,and can effectively reduce the rate of misjudgment,but Split Bloom Filter and Dynamic Bloom Filter have increased the the query time cost of the elements.In this paper,according to the current Bloom Filter dynamic storage and query and other issues,this paper proposes a Multi-groups Balance Matrix Bloom Filter(M-BMBF),M-BMBF initiates a r×m matrix bloom filter according the size of dataset,introduces multiple located hash functions which can be used to divide the matrix Bloom Filter into multi-group to achieve balanced insertion and efficient query operations.In order to slow down the bits consumption rate in Bloom Filter when a new element is inserted,we propose a longest-bit match filling algorithm,which will select a Bloom Filter as the destination position for insertion from the candidate bloom filters according to the rule that fewest bits will be changed due to this insertion operation.HDFS is an open source distributed storage system,because of its own characteristics of the mechanism with a copy,so there is a lot of duplication of data in HDFS,and HDFS on these copies are not effective treatment,resulting in a great waste of storage space.Finally,based on the in-depth analysis and research of HDFS,this paper designs and implements a HDFS based M-BMBF repeat data deletion system,and the system is test.Experimental results show that compared with the typical Split Bloom Filter,M-BMBF can in the algorithm to maintain the consumption of time is based on the constant,effectively saving storage space and reduce the misjudgment rate.At the same time,the algorithm M-BMBF through rapid judgment duplicate data so as tosave data uploaded to the HDFS and the storage space of HDFS...
Keywords/Search Tags:Bloom Filter, Multi-groups Balance Matrix Bloom Filter, HDFS, Duplication of data
PDF Full Text Request
Related items