Font Size: a A A

Research On Hive Query Optimization Base On Parquet Format

Posted on:2018-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:C X LiuFull Text:PDF
GTID:2428330566495776Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet industry and the explosive growth of global data volume,big data has become a research topic of great concern.As the most commonly used data warehouse software,Hive implements the query processing engine on Hadoop distributed system architecture.However,Hive still lacks the query speed when performing the query operation.This problem will make users spend more time doing data analysis,and even lead to the result that data analysis and processing speed cannot keep up with data growth rate,seriously restricting the development of big data.Nowquerying Parquet column storage file with Hive is widely used,and how to improve the efficiency of Hive query speed in Parquet file is a particularly important issue.By researching and analyzing Hive query processing flow,this research finds out some improvement methods that can greatly improve the efficiency of Hive query in some cases.First under the premise of batch query,by further analyzing Hive query process,this researchfinds that Hiveexecutor many repeat instructions in the process of reading data,whichleads tothe problems like the code path is too long and the number of CPU instructions is too large.Then the research comes up with the optimization of vectorized query.This optimization uses the vector as the basic unit of operation.The vector is loaded with a batch of data each time,and this batch of data is processed in a single pass.Second is that under the premise of querying nested column data,this research analyzes the existing field pruning of Hive,andfinds the problem that it cannot filter out redundant fields in the structure.Based on this,the research proposes a more granular field pruning optimization,so that it can filter the extra fields in the structure body before the query is executed.This optimization avoids filtering the unnecessary fields in the runtime process and improves the query efficiency.In the validation of these two optimizations,the research uses typical database test benchmarks and a typical case in the common application scenarios.After generating the test data,the research compared the execution time of the same SQL statement before and after optimization.The result verifies that under the premise of particular querying,these two optimizations can significantly improve the efficiency of Hive query on Parquet file.
Keywords/Search Tags:Hive, Parquet, Query optimization, Vectorization, Nested column data
PDF Full Text Request
Related items