Optimize Parquet Through Bloom Filter

Posted on:2019-04-26

Degree:Master

Type:Thesis

Country:China

Candidate:J Che

Full Text:PDF

GTID:2428330566995788

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The optimization of the big data,almost always around the Hadoop ecological.Some optimization are focused on sql related optimization.There are some optimization based on storage layer,Parquet is a storage format,but also belongs to the storage layer.Parquet has been widely used in the field of big data ecology due to the advantages of inline storage in compression and reading.Bloom Filter is a space efficient data structure,efficient use of digits is a very simple representation of a collection,and it can quickly determine whether an element belongs to this collection.We can use this feature to achieve the index in Parquet.But it is not perfect,In judging whether an element belongs to a collection,it is possible that will not belong to the collection of elements mistaken for this collection.We call it false positive.Not suitable for "zero error" scenarios.However,Bloom Filter does not determine the existence of the elements of the collection is not in the collection.In other words,through the Bloom Filter,determine the element is not in the collection,then the element is certainly not in,if it judged that the element in the collection,then the collection may not in this collection.But We can minimize the false positive rate by adjusting the number of bits and adjusting the number of hash functions.By integrating Bloom Filter into Parquet's Index Page,we can quickly determine which blocks do not need to be scanned and quickly filter out the blocks that do not need to be scanned.By adjusting the Bloom Filter parameters to different data types and sizes,to achieve the best results.

Keywords/Search Tags:

Parquet, Index, Bloom filter, False positive

PDF Full Text Request

Related items

1	Bloom Filter VS Weighted Bloom Filte
2	The Design And Implementation Of A Scalable Counting Bloom Filter
3	An Efficient Algorithm For The Two-phase Bloom Filter
4	OBF-Index:A Distributed Multi-Dimensional Index Based On Ordinal Bloom Filter
5	Saving energy in network hosts with an application layer proxy: Design and evaluation of new methods that utilize improved Bloom filters
6	Research On High-Performance Hashing Techniques And Their Applications
7	Towards Efficient Read For LSM-tree-based Key-Value Stores
8	Study And Implementation On Distributed Hash Index Structure In P2P Environments
9	Study And Implementation On Distributed Hash Index Structure In P2p Environments
10	The Analysis Of IDS Decoder Module And Implemention Of Reduce IDS False-Positive Rate