Research And Optimization Of Data Real-time Query Analysis Platform Based On Kylin

Posted on:2019-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:M K Li

Full Text:PDF

GTID:2348330542998164

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The idea that data has values has taken root in popular in the era of big data.And the technique of Data Query and Analysis is the foundation of Date Processing and Mining.At present,there are a lot of research and open source on Distribute Query Engine.However,Impala can't feedback in a second which is based on column storage and MMP with the explosion of data.At the same time,Apache Kylin which based on precomputation idea is facing the dimension explosion problem.In industry,companies provide data query service platform for various business line to meet data query requests.The platform set up both Kylin and Impala on same dataset to meet the different need of business,but the scheme which combine two query engine roughly raises cost of learning for users at most of time,users are difficult to determine which query analysis tools can provide faster and better service.In view of the above academic and commercial shortcomings,this paper has carried on the in-depth academic research and optimization to the data real-time query platform based on Kylin.The main contents and results of this paper include the following points:Firstly,In view of the problem of dimension explosion and cuboid waste in Kylin's Fast Cube Construction Model,the Cube Building Model based on Query Log is proposed and studied.This model gradually materialized the cuboids in data cube according to the needs reflected by query logs,and recorded the materialized state of the entire data cube,so as to optimize the consumption of resources during cube construction and shorten the cube first construction time.Second,In view of the fact that the static materialization strategy of data cube can not adapt to the change and migration of query distribution,partial materialized view failure and query performance degradation,a materialized view self-adjusting algorithm based on query log is proposed.The algorithm takes fixed query time as a cycle and updates query statistics periodically.Futhormore,the algorithm adaptively adjusts the set of materialized view according to the set threshold,which keeps the stability of query efficiency and reduces the shake of materialized view.Third,Aiming at the defects of the coexistence of multiple data query engines on the previous data query and analysis platform,a query engine selection strategy is studied.This strategy implements the routing of query engine to automatically distribute SQL tasks based on historical query statistics,and dynamically selects relatively better query engine,so that data analysts can use Kylin precomputed data without knowing data cube,so as to reduce learning cost.Meanwhile,query engine selection strategy will redirect query task to Impala which make up the shortcomings of the proposed data cube construction model based on query log when Kylin is poor in the case of cuboid missing.Based on the contents and results of the above research,Kylin source code is modified and optimized to implement the propose cube building model,father more,a data real-time query platform is set up using Kylin,Impala and Hive.The system could build a cube with sparse cuboid according to user configuration,adjust cube materialized state by utilize historical query log,and provide a relatively superior query engine and query analysis service for users.

Keywords/Search Tags:

kylin, data query analysis, cube building model, materialization strategy, query log

PDF Full Text Request

Related items

1	Research On The Efficient Materialization And Fast Query Of Condensed Data Cube
2	Research And Implementation Of Online Multiple Aggregation Query System Over The Big Data
3	Research On Aggregation For Complex Query Based On Data Cube
4	Research And Implementation Of Histogram Cube Compressed Storage And Incremental Updating And Query Under Cloud Environment
5	Research On Distributed OLAP Query Optimization Based On Hive
6	Research Of Distributed Data Cube Partial Materialization Method Based On Genetic Algorithm
7	Research On Key Methods Of Efficient Multi-dimensional Online Analytical Processing Query
8	Research And Implementation Of Construction And Query Techniques Of Histogram Data Cube Based On Hadoop
9	The Ad-Hoc Query System Based On Multi-Dimension Data Model
10	Design And Realization Of Optimized Query Strategy About Multi-Tenant Saas Based Application