Font Size: a A A

Design And Implementation Of Query Optimization Module For Distributed Column Database Based On Memory

Posted on:2022-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:X LiaoFull Text:PDF
GTID:2518306524493444Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the times,the emergence of various service functions makes all kinds of data show explosive growth,and these data often contain greater economic value.In order to make use of these large amounts of data and analyze and mine the knowledge as decision support,the market demand for data analysis from all walks of life is growing day by day.But the traditional database query execution way can't meet people's data analysis demand gradually.Traditional distributed databases mostly adopt the plan-first execute-next approach for query optimization.This model is completely dependent on statistical data and optimizer.Even if there is a small error,the cost will be magnified hundreds of times in the case of large amount of data.This framework appears to be very weak in the era of big data.Compared with the adaptive query system will be a better solution.In adaptive query system is no longer limited to traditional query framework,but a complete plan segmentation to contain more than one phase of the plan,through the real data in the process of the stage to perform before on the last stage of the query plan optimization makes plans to generate and cross query execution,in order to solve the problems of the optimizer to mistake brings,improve system robustness and performance.This thesis realizes a better query engine by using adaptive query framework design.The main work and innovation of this thesis are as follows:1.Solve the data skew in the query process: obtain statistical information through random sampling in distributed scenarios,and use the information to re-partition the data and balance the working pressure of each node.2.Optimize the re-optimization process in the conventional adaptive query system:a new cost estimation method is introduced,the concept of the optimal sub-plan effective range is proposed,and Newton iteration method is used to increase the flexibility of solving the effective range of different physical operators.By calculating in advance the effective range of each physical operator stored plus the execution speed of planned fast reduplication optimization.3.Improve the join join order fault: the drop of system performance increases for pre-join operation data,the program is executed by a dimension table generated filter to filter the fact table in front of the join operation,make up for the conventional adaptive query logic cannot modify the drawback of the plan,make poor join join order also can close to the optimal execution effect,increase the robustness of the system.4.Improve query task parallelism: traditional plan-based distributed databases pay more attention to the parallelism between individual query tasks,but not the parallelism of the sub-tasks of a single query statement.They generally divide the plan according to the logical order of execution all stages,but did not distinguish between main tasks and side tasks.In a data analysis-oriented scenario,it takes a long time to execute a single task.The system will divide the plan more logically,split out side tasks to execute concurrently to improve the concurrency of a single query.
Keywords/Search Tags:data analysis, adaptive query, query optimization, distributed database
PDF Full Text Request
Related items