Font Size: a A A

Prediction Query Optimization Across Data Processing Engines

Posted on:2023-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2568307052996169Subject:Electronic information
Abstract/Summary:PDF Full Text Request
In recent years,the growth of large-scale data and the development of the field of machine learning have driven the growth of analytical queries diversification.Prediction queries are a type of analytical query that loads data from a data storage system,preprocesses it,and finally predicts the results by machine learning models.They are widely used in business and decision analysis in industry.One of the hotspots is prediction query optimization based on distributed databases and big data processing systems,which is of wide interest in academia and industry.A prediction query plan is executed across data processing engines.In particular,the big data processing system evaluates the prediction query and then delegates partial operators to database.Alternatively,it pushes down a subquery to database.When pushing down queries containing only simple operators such as ”project-filter-scan”,the execution of subqueries is inefficient and the network overhead is high.When pushing down queries containing complex operators such as join,grouping,etc.,the data transfer across data processing engines in large-scale data scenarios becomes a bottleneck in the execution of prediction queries.The goal of this thesis is to improve the efficiency of the execution of prediction queries across data processing engines.When pushing down queries containing only simple operators,the goal of this thesis is to reduce the execution time of subqueries and reduce the network transfer overhead between tasks across data processing engines.When pushing down queries containing complex operators,the goal of this thesis is to design a method that can push down queries containing complex operators such as joins,groupings,etc.in parallel with large datasets.It solves the issue of time-consuming data transfer and out-of-memory error.Based on the above research goals,this thesis needs to design a prototype system to verify the effectiveness of the optimization proposed in this thesis when pushing down queries.The main contributions of this thesis around the above study are as follows.1.For queries containing only simple operators,we propose a partition-based query rewriting method and a partition-based location-aware task scheduling mechanism.In order to improve the efficiency of the subqueries that are pushed down,we propose a method to rewrite queries containing only simple operators into a set of queries based on partition.This method allows the query execution to change the data scanning process from a large number of random reads to sequential reads,and reduces redundant data loading.In addition,to reduce the network transmission across nodes,we propose a task scheduling mechanism.It schedules the task of a partition-based rewritten query to the executor of the physical machine which the partitions are located.The experiments show that the performance of prediction queries optimized by applying our method is 2.9 to 4.4 times higher than the performance of prediction query plan generated by query rewriting of systems such as Spark and Presto.2.For queries containing complex operators,we design a method for query rewriting based on semantic equivalence rules.This query rewriting method rewrites queries containing Join,Group-by operators into an equivalent query set on the basis of semantic equivalence.This approach is applied to the plan optimization and the generated queries are pushed down to the database in parallel.Optimized plan not only accelerates the computation by pushing down complex operators,but also transfers the execution results between the two systems in parallel,providing good scalability.The experiments show that the performance of the prediction query plan generated by applying our method is more than 1.2 to 10 times higher than the performance of the prediction query plan generated by Spark,Hive3,SuperSQL and other systems.3.We implement a prototype system OBSpark that applies the parallel extension methods for queries containing only simple operators and queries containing complex operators.We implement the prototype system on the open source big data processing system Spark and the distributed database OceanBase.In addition,we describe the design ideas and system implementation of the prototype system.In summary,we focus on the problems of inefficient execution of prediction query operators executed across data processing engines and time-consuming data transfer between across-engine tasks.We propose a partition-based query rewriting method,a task scheduling mechanism and a semantic equivalence query rewriting method.The experiments show that applying our optimization techniques significantly improves the performance of prediction queries compared with the query rewriting methods available in Spark,Presto,Hive3 and other systems.
Keywords/Search Tags:Prediction Query Optimization, Subquery Push-Down, Task scheduling, Query Rewriting
PDF Full Text Request
Related items