Research And Implementation Of Skew Join Optimization Technology On MyCat

Posted on:2017-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:R Hu

Full Text:PDF

GTID:2308330503953784

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, with the development of information era, distributed database technology is more and more applied in the production management of enterprises. How to access the data in these distributed clusters in an efficient way is a big problem in the distributed field. The concept of distributed proxy middleware provides a new way to solve the problem of data access in distributed environment. Distributed proxy middleware is located between the distributed database and the application. It provides a unified interface abstraction for the application to access the underlying database cluster, avoiding the application in the face of distributed clusters directly. Application can not have to spend a lot of work to deal with the problem of data source switching, transaction processing, data aggregation and other problems about data slicing to really focus on business.Join operation is a common expensive operation in distributed environment, how to avoid the problem of data skew in the join process in the case of the introduction of distributed agent middleware, to implement it efficiently with the minimum transmission cost and optimal load balancing strategy, it is important for the whole enterprise system.Current approaches mitigate such problems caused by data skew by using(partial) replication. However, contemporary replication-based approaches 1) introduce overhead, since they usually result in redundant data movement, 2) are sensitive to parameter tuning and value of data skew and 3) typically require that one side is small. This paper is based on the research and summary of related fields, in view of the scenario above, the main research work is carried out as below:1.Firstly, this paper introduces the distributed proxy middleware MyCat, the problem of data skew in join operation in distributed environment. Then it reviews the research history and status of data skew in distributed environment and summarizes their characteristics and shortcomings. Then, For the asymmetric fragment and replicate join can not deal with the large network transmission effectively when a large amount of data is used to copy. This paper propose and implement a grouping fragment replicate join strategy aimed at robustness in terms of the size of both join sides. It’s proved that the method can reduce the transmission overhead in the network effectively.2.For the previous join algorithms are sensitive to parameter tuning and value of data skew, the concept of “virtual node” is put forward. By setting up a series of “virtual nodes”(which is much larger than the real nodes) and deciding the mapping relationship between the virtual and actual nodes dynamically according to the real-time load state and run-time parameters of system, this method correct the related defects in previous join algorithm effectively with load balancing capability of fine-grained.3.Finally, the article takes the open source distributed middleware MyCat as the experimental platform, and the modified TPCH data set as the test data. Through the comparison and analysis of the related performance test results, the optimization technique in this paper can improve the operation performance of distributed join effectively especially in the presence of data skew.

Keywords/Search Tags:

Query optimization, Distributed system, Join query, Data skew, Load balancing

PDF Full Text Request

Related items

1	Distributed Stream Join System Load Balance Strategy Studies
2	Research On Join Query Optimization Algorithm In Distributed Database
3	Research On Data Query Optimization Algorithm Of Distributed Database
4	Join Query Optimization For Large-Scale Data Based On New Computing Architecture
5	Distributed Database Multi-join Query Optimization Algorithm
6	Load Management Policies Of The Distributed Stream Processing System ARTs-SH
7	Join Prpcessing And Optimizing On Large Clusters
8	Research On Query Optimization Algorithm Of Distributed Data Base
9	Research On Data Query Optimization In Distributed Database
10	Research On Online Aggregation Query Optimization Based On Spark