Font Size: a A A

Design And Implementation Of Hive On Spark Dynamic Partition Pruning

Posted on:2019-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:J G TianFull Text:PDF
GTID:2428330566495788Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In a star data schema,it is common to map one or more dimensions of dimension table to partitioned column.If a partitioned fact table joins a dimension table which describes partition information,while the condition of Join depends on the partition column and the Where condition will filters the partition information,in this scenario,if you can filter unwanted partitions while scanning fact tables,the amount of data loaded into fact tables will greatly reduce.Therefore,in order to reduce the SQL query time in the above scenario,dynamic partition pruning optimization technology was introduced in Hive on Spark.Through detailed technical research on the process of Hive on Spark,it is clear that the dynamic partition pruning optimization in Hive on Spark is feasible.At the same time,by analyzing the problems existing in Hive on Spark before dynamic partition pruning,some optimization targets and the overview of optimization process in the optimized scenario are given.Hive on Spark dynamic partition pruning optimization technology involves multiple stages of Hive,including the logic plan optimization phase,generate the physical planning phase and Spark Task execution phase.During the logic plan optimization phase,the predicate is synthesized for the Filter Operator that matches the predefined regular expression,which can enrich the expression relationships it contains.Before the physical plan is generated,the Filter Operator which satisfies the dynamic partition pruning condition is parsed,the branch containing the partition pruning semantics Operator is generated afterwards,and then the branch is segmented from the original Operator Tree.During the physical plan generation phase,the independent Spark Task generated by the above branch writes the partition information to be read into the HDFS file.In the Spark Task execution stage,the Task that scans the partition table loads the partition information from the HDFS file and filters out unwanted partitions.Combined with the optimization index proposed by Hive on Spark Dynamic Partition Pruning Technology,the optimization effect of technology was tested.The test results showthat in the Hive on Spark environment,the dynamic partition pruning technology can greatly reduce the amount of data transmitted during the Join Shuffle phase in its application scenario.Besides,it can also reduce the execution time of the query and improve the query performance of Hive on Spark.
Keywords/Search Tags:Star schema, Partition, Hive on Spark, Dynamic Partition Pruning, Improve query performance
PDF Full Text Request
Related items