Research And Application Of Data Deduplication Technology Based On Bloom Filter

Posted on:2017-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:J Z Huang

Full Text:PDF

GTID:2428330488479902

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the era of big data,will produce large amounts of data every day,and in these data in a high proportion of duplicate data.How to quickly retrieve massive duplicated data has received more and more attention.As a kind of lightweight data structure,Bloom Filter has been widely used in data retrieval,but Bloom Filter also has shortcomings.Because it is a kind of probabilistic data structure,it exist the occurrence of the error in the data retrieval.At the same time,Bloom Filter is only for static data sets,and can not be dynamically extended storage.In recent years,there are many improvements on the shortcomings of Bloom Filter.Split Bloom Filter and Dynamic Bloom Filter solve the problem of the elements of dynamic growth,and can effectively reduce the rate of misjudgment,but Split Bloom Filter and Dynamic Bloom Filter have increased the the query time cost of the elements.In this paper,according to the current Bloom Filter dynamic storage and query and other issues,this paper proposes a Multi-groups Balance Matrix Bloom Filter(M-BMBF),M-BMBF initiates a r�m matrix bloom filter according the size of dataset,introduces multiple located hash functions which can be used to divide the matrix Bloom Filter into multi-group to achieve balanced insertion and efficient query operations.In order to slow down the bits consumption rate in Bloom Filter when a new element is inserted,we propose a longest-bit match filling algorithm,which will select a Bloom Filter as the destination position for insertion from the candidate bloom filters according to the rule that fewest bits will be changed due to this insertion operation.HDFS is an open source distributed storage system,because of its own characteristics of the mechanism with a copy,so there is a lot of duplication of data in HDFS,and HDFS on these copies are not effective treatment,resulting in a great waste of storage space.Finally,based on the in-depth analysis and research of HDFS,this paper designs and implements a HDFS based M-BMBF repeat data deletion system,and the system is test.Experimental results show that compared with the typical Split Bloom Filter,M-BMBF can in the algorithm to maintain the consumption of time is based on the constant,effectively saving storage space and reduce the misjudgment rate.At the same time,the algorithm M-BMBF through rapid judgment duplicate data so as tosave data uploaded to the HDFS and the storage space of HDFS...

Keywords/Search Tags:

Bloom Filter, Multi-groups Balance Matrix Bloom Filter, HDFS, Duplication of data

PDF Full Text Request

Related items

1	Research And Application Of De-duplication Algorithm Based On Double Bloom Filter
2	Researches And Applications On Efficient Bloom Filter For Big Data
3	Multi-Bloom-Filter Query Algorithms And Their Applications
4	Privacy Preserved Bloom Filter And Key-value Based Bloom Filter
5	Research And Application Of Bloom Filter In Duplicated Webpages Deletion
6	Research And Application Of URL De-gravity Algorithm Based On Bloom Filter Algorithm
7	The Research On The Muti-keywords Search Technology Over P2P Network Based On Bloom Filter
8	The Design Of Bloom Filter Algorithm For Key-value Storage
9	Research On A Memory-Efficient Bloom Filter For DPI
10	The Design And Implementation Of A Scalable Counting Bloom Filter