Font Size: a A A

The Optimization And Application Of Big Data Query Based On Bloom Filter

Posted on:2019-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:W RaoFull Text:PDF
GTID:2348330542455563Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of information technology and Data Warehouse techonology,there is a huge amount of data generated all the time,More and more scholars and companies have paid more attention to the collection,cleaning,storage,query of these data.Searching for an element with a specific value is to confirm whether the element is a member of a particular set.With the increase of the amount of data,in order to ensure the system performance,element representation and lookup methods often need to consider from the aspects of space complexity,time complexity and accuracy rate.In many ways to find elements,bloom filter because of its low memory consumption,high search efficiency,has attracted a lot of attention.This thesis mainly studies the optimization and application of Bloom filter algorithm of big data environment.The theory and application scenarios of Bloom Filter is illustrated by an analysis sample of customer behavior data.Firstly,the requirement analysis of massive data query framework is carried out.In order to obtain clear data consanguinity and reduce repeated development,the data warehouse is layered theoretically.The characteristics and functions of each layer are analyzed,and the original data access module,the original data extraction module,and the paying customers extraction module are designed and implemented aiming at each layer of data flow direction.Based on data cleaning and preprocessing,It is necessary to determine whether an account belongs to a paid user in a large amount of data when extracting data of paying customers.Bloom Filter can be used to search for large dataset Effectively at a rapaid rate.At the beginning of this thesis,Hive is used to solve cascaded queries.Operation is concise but parsing the SQL and executing MapReduce take a long time.Then,in-memory database,like MongoDB,is used to solve that question,which has a lookup time complexity of O(1)after default index(_id)is the only one permitted to save the premium accouts.The disadvantage is that the functionality needed is limited and the pressure brought by concurrent(one to multiple)query becomes bigger as the valume of data increses.Then the accounts can be read into momery througth appropriate data structure using distributed cache.the mode of data access is changed into one-to-one,resulting in the bigger usage of memory.With a small amount of data to process.the performace of HashSet is acceptable because of its convience and speed.As the volume of data increase,Heap memory may overflow.Then comes the Bloom Filter,a custom data structure.the basic theory and false positive rate are analyzed,the error data(False Positive Error),reduced by Bloom Filter,can be eliminated.Theory analysis and experiment shows: the features of low space usage and high search efficiency for Bloom filter are appropriate to solve this problem.
Keywords/Search Tags:Query, Big data, MapReduce, Bloom Filter
PDF Full Text Request
Related items