Font Size: a A A

Research And Implementation Of Big Data Real-Time Query Optimization Based On Hypergraph And Bushy-Tree

Posted on:2016-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:J Y MaFull Text:PDF
GTID:2308330470467755Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Impala is a big data real-time query system, which is launched by the Cloudera Inc. It makes use of the technology of distributed system to make big-data query efficiently. The system uses Hadoop as storage and Hive’s metadata table as a tool for statistics. Although the up-to-date version of Impala has some techniques for query optimization, it only supports query plans in left-deep tree form. On the other hand, McCHyp (MinCutConservative Hypergraph) based query optimization has problems with large search space and long optimization time. We propose a bushy-tree and Improved-McCHyp algorithm based Impala query optimization method. We firstly analyze how Impala generate query plans, and modify the procedure to support query plans in bushy-tree form. Then we analyze the McCHyp algorithm and improve it with pruning strategy to reduce query optimization time, and explain the integrity and correctness. After analyzing some cost models, we propose a new cost model which considers disk I/O, network transfer and the size of the right table, and integrated Improved-McCHyp algorithm into Impala 2.0 to generate better query plans with user’s SQL statement. Finally, we evaluate our work by TPC-DS through four aspects: optimization algorithm, cost model, query performance and extensibility. Experimental results show that our method improves query efficiency, and query response time can be decreased further by adding more nodes.
Keywords/Search Tags:query optimization, Impala, cost model, bushy tree, query plan
PDF Full Text Request
Related items