Font Size: a A A

Researches And Applications On Efficient Bloom Filter For Big Data

Posted on:2016-08-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiFull Text:PDF
GTID:1368330491452455Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a new information revolution climax,big data emerges following cloud computing,Internet of things,mobile Internet.Big data has brought great challenges in data transfer and sharing,data storage,data retrieval and analysis.Information representation and query algorithms are key technologies in big data environment.When the data expands,developing efficient query algorithm with simple data structure has become one of the key technologies to enhance the effieicny of data management.As a succinct data structure supporting efficient information representing and query,Bloom filter(BF)has attracted lots of research interests and has been widely used in storage management and distributed systems.However,ever-increasing data in big data envieoments brings great challenges to apply Bloom Filter for supporting high speed and effiecent data respresent and manamgent in network filed and storage filed.Focusing on designing bloom filter in big data environment,this thesis makes following contributions.(1)Design of a hash Bloom filter for name lookup in named data networking.Quick name lookup is one of the key technologies to improve data packet forwarding rate in Named Data Networking(NDN).NDN uses a similar URL hierarchical naming scheme.Unlike fixed length IP addresses,NDN names have variable and unbounded length,which makes the name lookup more complex and difficult than IP addresses.To provide quick name lookup technique,this thesis designs a hash Bloom filter(HBF).HBF consists of g on-chip Counter Bloom filters(CBFs),g on-chip counters and g off-chip hash tables.Each hash table is associated with a CBF and a counter.To reduce the false positive rate introducing by unbalanced name insertion in to CBFs,we propose two-hash-choice algorithm which disperses the FIB/CS/PIT entries into g hash tables and CBFs.Moreover,HBF has a good feature of parallel processing of data packet forwarding because HBF adopts multiple hash tables and CBFs.Theoretical and simulated results demonstrate that HBF can achieve very efficient name lookup by well utilizing the on-chip memory through localization and filtering function of CBF.Therefore,the proposed HBF improves data packet forwarding rate and effectively avoiding flooding attacks.(2)Design of a data temperature perceptive Bloom filter for flash-based storage.Because of the inherent asymmetry of read and write speeds,out-of-place update,hot and cold data identification algorithm has become a key factor in improving the performance and reducing the flash storage costs.To identify the hot and cold data,this thesis designs a Data Temperature Perceptive Bloom Filter(DTPBF).DTPBF divides the period of data access record into n cycles.Based on round-robin fashion,DTPBF records data access frequency of each cycle through combination of a CBF and n BFs.The access frequency of each cycle is represented by a different temperature.The characteristic of the data access is calculated by utilizing n cycle temperatures.To guarantee single characteristic of the data,we utilize a bijective function to map n cycle temperatures into a single temperature value.Based on the single temperature value,the data with the same visiting regular pattern can be placed in same block in flash.Therefore,the proposed DTPBF can reduce the flash write magnification,improve flash write performance,enhance flash life and reliability.DTPBF also can identify the occasional hot data,which raises the flash buffer hit ratio,reduces the cost of data migration in hybrid storage mode and improves system efficiency.Theoretical and simulated results demonstrate that DTPBF can achieve low memory consumption and computational complexity.(3)Design of matrix-indexed Bloom filter(MIBF)for flash-based key-value storeIndexing structure is one of key technologies to improve the performance of system insertion and query.Directly using traditional B+ tree indexing in key-value storage based on flash easily leads to cascading updates from its parent node to the root node,which results in a serious decline in system performance.To solve this problem,this thesis designs a Matrix-Indexed Bloom filter(MIBF)for flash-based key-value store.MIBF is composed of multiple Bloom filter group(MBFG)represented by mxs bits matrix and additional Bloom filter(ABF).The core idea of MIBF is that flash page address of key-value pair is split into multiple bit groups,and MBFG is also divided into multiple groups correspondingly,each group bits are represented by the one group Bloom filter(BF)in MBFG,while the combined value of key and flash page address is inserted into ABF.When an element is queried according to the key,each BF group in MBFG generates a plurality of bit values and combines to generate the flash page address for the key-value pairs.In order to reduce pseudo flash page address generated by the MBFG due to false positive probability,pseudo flash page addresses are filtered out through the ABF,thereby achieving more accurate address location,reducing the number of flash access,and improving the system performance.Compared to previous work,the address location accuracy of MIBF query is improved and the access number of RAM and flash decreased significantly,which greatly improves the performance of solid state sisk(SSD)insertion and query.(4)Design of an accurate counter Bloom filter for Hadoop-Join algorithmWithout traditional data relational model,Key-value storage does not provide a similar SQL join language to user.Hadoop-Join algorithm is one of the key technologies to improve the efficiency of the comprehensive statistics analysis across large-scale distributed key-value data sets.However,hadoop is not very efficient to perform the join operation since it always processes all records in the datasets even in the cases that only small fraction of datasets are relevant for the join operation.An Accurate Counting Bloom Filter(ACBF)is proposed to filter out redundant intermediate records in join operation,which reduces communication costs and improves the performance of Hadoop-join algorithm.In order to take advantage of CBF free space to reduce the probability of false positive,we divide the counter vector of ACBF into the multilayer structure which organized by the offset index.The first level of ACBF is used to perform query operations for a set of membership,and the other layeres are used to represent insertion and deletion of the counter.In order to minimize false positive probability,the paper maximizes the size of the first layer to optimize the ACBF,while maintaining the normal function of CBF.Simulation results show that in the same situation of memory overhead,the false positives of ACBF is significantly lower than the CBF.ACBF is implemented the join operations in key-value database based on MapReduce model.The redundant data produced in the process of the shuffle is filtered out the through ACBF,which reduces network I/O,and further improves the performance of reduce-side join algorithm.(5)Design of accurate multi-dimension counting Bloom Filter aimed for multi-dimensional attribute dataPrecision multi-dimensional data representation and query algorithms become a challenging problem in dynamic big data enviroement.To conque this challenge,this thesis proposes an Accurate Multi-Dimension Counting Bloom Filter(AMD-CBF)query algorithm based on bijective function.When representing or querying an element,AMD-CBF needs two steps.The first step is to hash and map each attribute of the element to their corresponding Accurate Counting Bloom Flter(A-CBF).The second step is to transform all attributes of the element into a value by bijective function to represent the overall information of the element,then the value is hashed and mapped into a Combined Counting Bloom Fliter(C-CBF)for completing the representation and query confirmation of the elements overall.Both theoretical analysis and experiment demostrate that the AMD-CBF can support concise representation,approximate membership query and deletion of multi-Dimension data set and significantly lower false positive rate and improve query accuracy compared to similar researches.
Keywords/Search Tags:Big Data, Bloom Filter, Named Data Networking, Flash, Key-Value Store, Hadoop-Join, Multi-Dimension Data, Efficient Bloom Filter
PDF Full Text Request
Related items