The Researches And Applications On Bloom Filter Query Algorithm

Posted on:2008-02-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:K Xie

Full Text:PDF

GTID:1118360215979789

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Information representation and query are two core problems of exchanging data objects for most application systems. The size of data sets encountered in databases, networks and many other applications has expanded dramatically in recent years. When data sets become very big or expensive to access, querying is a challenge for accessing and representing the set. It becomes increasingly important to handle massive data set using compact data structure with efficient representation and membership query operations. Thus, how to make membership queries by the aid of a compact structure becomes urgent with the data expanding in modern computer systems. This is a common issue for distributed systems that need to share information.A Bloom filter is an excellent data structure that can succinctly represent a data set in order to support membership queries, and filter out effectively any element that does not belong to the set. It is widely used in databases, networks and distributed systems and it has great potential for distributed applications where systems need to share information about available data. This thesis surveys the mathematics behind Bloom filters, some important variations and network-related applications of Bloom filters. The analysis of current research shows that although Bloom filters are now starting to receive significant attention from the algorithmic community and there have been a number of recent results, there may well be further improvements to be found. This thesis concentrates on the algorithm of Bloom filter. Some creative works in this thesis are mainly focused on the following aspects:1) This thesis presents a novel Bloom filter, called Basket Bloom filter (BBF). The BBF differentiates elements in a data set depending on their query invalidation cost, by clustering elements into different baskets. The total query invalidation cost function is defined. In order to minimize the total query invalidation cost, the genetic algorithm is employed to find the optimal number of hash functions for every basket. Simulation results show that, the BBF has 40% lower total query invalidation cost than the standard Bloom filters while the executing time is almost the same.2) To solve the scalability problem of Bloom filters, this thesis presents a new design of a scalable Bloom filter (SBF) for an expanding data set. The SBF keeps a low false positive rate by adding Bloom filter vectors with double length when necessary. The thesis proposes algorithms for element insertion and query operation of SBF by employing the H₃ class of universal hash functions. Theoretical and experimental results demonstrate that the new SBF provides false positive rate as low as 21.3% of the dynamic Bloom filter presented before and the querying CPU time increasing with logarithmic rather than linear. Therefore, the proposed SBF outperforms other current scalable Bloom filters significantly.3) Based on the analysis of the current Multi-dimension Bloom filter (MDBF), this thesis find that MDBF separates each attribute from whole element and any attribute's false positive will result in the whole element's false positive. This thesis proposes a new Multi-dimension Bloom filter, termed Combine Multi-dimension Bloom filter (CMDBF). CMDBF adds a Combine Bloom filter to represent the whole element comparing with the MDBF. When represent or query an element, CMDBF needs two steps, the first step is hashing or checking every attribute with the MDBF, the second step is further hashing or checking the whole element with Combine Bloom filter for further validation. Simulation results show that the CMDBF outperforms MDBF for large reduction of false positive rate.4) This thesis comprehensively discusses the relationship between algebraic operations on Bloom filters and algebraic operations on data sets. This thesis completely define algebraic operations including OR, AND, XOR, NOT, MINUS on Bloom filter, and study the membership query performance on Bloom filter and data set. Theoretical analyses and simulation results show that the Bloom filter ORed (ANDed) from the original Bloom filters can support element membership query on data set ORed (ANDed) from the original data sets.5) This thesis presents a new trace label based consistency maintenance algorithm in unstructured P2P systems, with the trace label is denoted by Bloom filter. It modifies the message datagram by attaching the address list of peers to which message have been sent. This can help to tell the duplicated message from the source peer by the aid of the attached address list in message datagram. Considering that the address list can become longer with the transmit time lapsing and the degree of P2P increasing, in this thesis we use Bloom filter to denote the address list. The Bloom filter can succinctly represent the address list and simplify the query actions in the list by OR operations. Simulation results show that the new trace label based consistency maintenance algorithm can reduce the number of duplicated message largely. Moreover, the higher the degree of P2P, the more reduction of the number of duplicated messages and bandwidth utilization.6) This thesis presents a new hybrid service discovery scheme for MANET which is based on Bloom filter. In order to represent service information compactly and search service efficiently, the counting Bloom filter is used to represent the published service index. The scheme adopts two-level hybrid service discovery architecture and two-level storages. This thesis gives the main processes of server discovery, such as service publish, service query, service withdrawal, service consistency maintenance among several service coordinators, service update when node moving. The Bloom filter distance is also defined. A variation evaluation algorithm for dynamic data sets based on the Bloom filter distance is presented, called Bloom filter distance evaluation algorithm (BFDE). Experimental and theoretical results show that the BFDE has 99.7% evaluation accuracy rate and the scheme has merits of compact storage and good searching performance.

Keywords/Search Tags:

Bloom filter, Data Structures, Set Membership Queries, Computer Networks, Distributed Computing, Distributed Information System, Replica consistency

PDF Full Text Request

Related items

1	Private Temporal Membership Test Queries With Probabilistic Data Structures
2	The Formal Modeling And Optimization For Data Partitioning And Replica Consistency In Distributed Storage Systems
3	Research On Key Technologies Of Distributed Storage In Cloud Computing
4	Content Synchronization In Distributed Systems
5	The Design And Implementation Of A Scalable Counting Bloom Filter
6	Network Routing Algorithm
7	The Research On Strategy Of Multiple Replica In Distributed Environment
8	The Research On WAN Distributed Storage Technologies Based On P2P Architecture
9	Research On The Replica Consistency Of P2P Distributed Storage Systems
10	The Simulation And Comparison Of Bloom Filter And Its Improved Algorithms In The Distributed Environment