Research On Hive Query Optimization Base On Parquet Format

Posted on:2018-10-13

Degree:Master

Type:Thesis

Country:China

Candidate:C X Liu

Full Text:PDF

GTID:2428330566495776

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet industry and the explosive growth of global data volume,big data has become a research topic of great concern.As the most commonly used data warehouse software,Hive implements the query processing engine on Hadoop distributed system architecture.However,Hive still lacks the query speed when performing the query operation.This problem will make users spend more time doing data analysis,and even lead to the result that data analysis and processing speed cannot keep up with data growth rate,seriously restricting the development of big data.Nowquerying Parquet column storage file with Hive is widely used,and how to improve the efficiency of Hive query speed in Parquet file is a particularly important issue.By researching and analyzing Hive query processing flow,this research finds out some improvement methods that can greatly improve the efficiency of Hive query in some cases.First under the premise of batch query,by further analyzing Hive query process,this researchfinds that Hiveexecutor many repeat instructions in the process of reading data,whichleads tothe problems like the code path is too long and the number of CPU instructions is too large.Then the research comes up with the optimization of vectorized query.This optimization uses the vector as the basic unit of operation.The vector is loaded with a batch of data each time,and this batch of data is processed in a single pass.Second is that under the premise of querying nested column data,this research analyzes the existing field pruning of Hive,andfinds the problem that it cannot filter out redundant fields in the structure.Based on this,the research proposes a more granular field pruning optimization,so that it can filter the extra fields in the structure body before the query is executed.This optimization avoids filtering the unnecessary fields in the runtime process and improves the query efficiency.In the validation of these two optimizations,the research uses typical database test benchmarks and a typical case in the common application scenarios.After generating the test data,the research compared the execution time of the same SQL statement before and after optimization.The result verifies that under the premise of particular querying,these two optimizations can significantly improve the efficiency of Hive query on Parquet file.

Keywords/Search Tags:

Hive, Parquet, Query optimization, Vectorization, Nested column data

PDF Full Text Request

Related items

1	Column Storage Design And Query Optimization For Nested Records
2	Study And Implementation Of Storage Model And Query Vectorization In Column-oriented Database
3	Study On The Analysis And Optimization Of Column Storage Performance Based On Hive On Spark
4	Research And Implementation Of Query Optimization In Column-Oriented Compressed Data
5	Research On Query Optimization In Column-Oriented Data Warehouse
6	A Design And Implementation On Storage Structure Extension Of Big Data Warehouse Hive
7	Multi-Query Optimization Strategy Design And Implementation In Column-based OLAP System
8	Optimization And Implementation For DWMS Column-Store Query Execution Engine
9	Research And Implementation Of Key Techniques For Query Rewriting In Column-Store Data Warehouse
10	Design And Implementation Of Data Dictionaries In Column Storage DWMS