Design And Implementation Of Hive On Spark Dynamic Partition Pruning

Posted on:2019-10-15

Degree:Master

Type:Thesis

Country:China

Candidate:J G Tian

Full Text:PDF

GTID:2428330566495788

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In a star data schema,it is common to map one or more dimensions of dimension table to partitioned column.If a partitioned fact table joins a dimension table which describes partition information,while the condition of Join depends on the partition column and the Where condition will filters the partition information,in this scenario,if you can filter unwanted partitions while scanning fact tables,the amount of data loaded into fact tables will greatly reduce.Therefore,in order to reduce the SQL query time in the above scenario,dynamic partition pruning optimization technology was introduced in Hive on Spark.Through detailed technical research on the process of Hive on Spark,it is clear that the dynamic partition pruning optimization in Hive on Spark is feasible.At the same time,by analyzing the problems existing in Hive on Spark before dynamic partition pruning,some optimization targets and the overview of optimization process in the optimized scenario are given.Hive on Spark dynamic partition pruning optimization technology involves multiple stages of Hive,including the logic plan optimization phase,generate the physical planning phase and Spark Task execution phase.During the logic plan optimization phase,the predicate is synthesized for the Filter Operator that matches the predefined regular expression,which can enrich the expression relationships it contains.Before the physical plan is generated,the Filter Operator which satisfies the dynamic partition pruning condition is parsed,the branch containing the partition pruning semantics Operator is generated afterwards,and then the branch is segmented from the original Operator Tree.During the physical plan generation phase,the independent Spark Task generated by the above branch writes the partition information to be read into the HDFS file.In the Spark Task execution stage,the Task that scans the partition table loads the partition information from the HDFS file and filters out unwanted partitions.Combined with the optimization index proposed by Hive on Spark Dynamic Partition Pruning Technology,the optimization effect of technology was tested.The test results showthat in the Hive on Spark environment,the dynamic partition pruning technology can greatly reduce the amount of data transmitted during the Join Shuffle phase in its application scenario.Besides,it can also reduce the execution time of the query and improve the query performance of Hive on Spark.

Keywords/Search Tags:

Star schema, Partition, Hive on Spark, Dynamic Partition Pruning, Improve query performance

PDF Full Text Request

Related items

1	Research And Application Of The Partition Technology In Real-Time Data Warehouse
2	Research And Application Of The Partition Technology In Real-time Data Warehouse
3	Research And Optimization Of Data Placement Method In Spark Partitioner
4	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
5	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
6	Research On Spark Shuffle Process Performance Optimization
7	Research On Holistic Schema Matching Technology On Query Interface
8	Research Of Data Partition And Query Optimization Based On Database Cluster
9	Diversified Image Retrieval Based On Non-uniform Partition Matroid Constraints
10	Dynamic Data Partition In Distributed Information Networking Database Management System