Query Optimization Based On Mapreduce In The Cloud

Posted on:2016-05-08

Degree:Master

Type:Thesis

Country:China

Candidate:D Ding

Full Text:PDF

GTID:2308330503977053

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the explosion of data in the past decade, big data is becoming a research hotspot in the information field. Many cloud-based distributed data processing platforms have been proposed to provide efficient and cost effective solutions for big data query processing, such as Hadoop, Hive, Pig, etc. However, most of the current research works focus on improving the performance of query processing based on the view of whole system without considering the features of queries themselves, such as the query similarity, which will cause tremendous redundant computation and reduce the query execution efficiency. Whatâ€™s more, almost all the existing work just translates the queries into the MapReduce task according to the traditional relational query optimization rules or implementing the query optimization simply by reducing the number of MapReduce task while ignoring the execution features of MapReduce framework, which will have an adverse impact on promotion of the multi-queries processing performance.To solve these problems, in this thesis, we propose a Multi-query optimization framework (Multi-Q) based on MapReduce-oriented cloud environment, which not only utilizes the dependence between multiple queries to take advantage of query results reuse, but also uses the optimal query sub-structures to achieve query structure reuse. Specifically, the thesis covers the following two topics:1) for realizing query results reuse, a cluster-based partition algorithm called CPA has been exploited to conduct the logic partition of the search range of query workload firstly. Then, a Multi-query Reuse Dependence Graph (MRDG) construction method on the basis of the cluster-based partition results has been presented to depict the dependence between the multiple queries. Finally, a Multi-Q processing algorithm based on MRDG has been put forward to achieve the query results reuse and reduce the redundant computation; 2) in order to achieve query structures reuse, firstly, an execution cost model based on MapReduce has been presented to evaluate the execution cost of different phrases of MapReduce thus proposing some optimal query sub structures. Secondly, on the basis of the execution cost model, a query structures reuse optimization algorithm has been designed, which achieves the query structure reuse and reduces query execution cost by embedding the optimal query sub structures into the execution plan. Finally, these two query optimization methods have been synthetically used to improve the overall query processing performance.We evaluate our approach by deploying Multi-Q system based on Hadoop in a real cloud environment, SEU-Cloud, and conducting extensive experiments based on the standard TPC-H dataset. The results verify that Multi-Q system can outperform Hive, while significantly reduce redundant query cost, thus boosting the query processing performance.

Keywords/Search Tags:

Big Data, Cloud Computing, Query Optimization, Query Result Reuse, Query Structures Reuse

PDF Full Text Request

Related items

1	Research And Improve On The Query Optimize Of MySQL
2	Research On The Collaboration Query Processing For Cloud Data
3	The Research On Query Optimization Technology Based On Big Data Platform
4	The Research Of Large-scale Spatial Nearest Neighbor Query In Cloud Environments
5	Research On Key Technologies Of Distributed Rank-aware Query Processing
6	Research On Integrity Verification Of Query Results In Cloud Computing
7	Research On Query Processing And Result Caching In Search Engine
8	Data Integrity Verification Technology Research And Implementation Of Range Query In Location-based Service
9	Design And Realization Of Optimized Query Strategy About Multi-Tenant Saas Based Application
10	Research On Fault-Tolerant Parallel Skyline Query Technology In Cloud Computing Environment