Font Size: a A A

Optimize Parquet Through Bloom Filter

Posted on:2019-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:J CheFull Text:PDF
GTID:2428330566995788Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The optimization of the big data,almost always around the Hadoop ecological.Some optimization are focused on sql related optimization.There are some optimization based on storage layer,Parquet is a storage format,but also belongs to the storage layer.Parquet has been widely used in the field of big data ecology due to the advantages of inline storage in compression and reading.Bloom Filter is a space efficient data structure,efficient use of digits is a very simple representation of a collection,and it can quickly determine whether an element belongs to this collection.We can use this feature to achieve the index in Parquet.But it is not perfect,In judging whether an element belongs to a collection,it is possible that will not belong to the collection of elements mistaken for this collection.We call it false positive.Not suitable for "zero error" scenarios.However,Bloom Filter does not determine the existence of the elements of the collection is not in the collection.In other words,through the Bloom Filter,determine the element is not in the collection,then the element is certainly not in,if it judged that the element in the collection,then the collection may not in this collection.But We can minimize the false positive rate by adjusting the number of bits and adjusting the number of hash functions.By integrating Bloom Filter into Parquet's Index Page,we can quickly determine which blocks do not need to be scanned and quickly filter out the blocks that do not need to be scanned.By adjusting the Bloom Filter parameters to different data types and sizes,to achieve the best results.
Keywords/Search Tags:Parquet, Index, Bloom filter, False positive
PDF Full Text Request
Related items