Font Size: a A A

Massive Distributed In-memory Columnar Database Query Engine For On-line Analytical Processing

Posted on:2018-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2348330512488075Subject:Engineering
Abstract/Summary:PDF Full Text Request
On-line Analytical Processing(OLAP)is one of the most important transactions in database field.With rapid increase of daily data,it is a big challenge for database systems to exploit potential value of the vast amount of data in short time.There are two kinds of query engines in current database systems.The first one is to adopt general computing frameworks such as Map/Reduce or Spark as its fundamental computing module.The main defect of this kind of query engines is that the computing model in general computing frameworks is synchronous.Synchronous model can lead to a long query delay because of the synchronous step.Besides,query engines based on Spark cost too much memory during processing.The second kind of query engines in database system like Impala or HAWQ uses a special designed computing model which parallelizes traditional query models.This computing model somehow breaks through computing bottlenecks in traditional database systems,but there are still some following drawbacks in these query engines: 1)the computing model is a simple extension of volcano model which queries by row.However,row-oriented query models induce too much redundant intermediate data in OLAP transactions,which can cause an extra computing cost.2)there are no effective scheduling algorithms in these query engines to improve query performance and resource utilization.This thesis developed a novel In-Memory Scalable Distributed Columnar Query Engine(MSDCQE)for OLAP transactions,which aims to solve above drawbacks of current database query engines.MSDCQE has low query delay and memory consumption when handling large volume of data.There are three contributions in this thesis.1)Study and analyze up-to-date query engines in distributed database systems.Classify query engines according to general computing model and special designed computing model.Abstract distributed task scheduling problem to workflow scheduling problem,comparing merits and defects of exist heuristic algorithms including List Scheduling and Genetic algorithms.2)Design and implement an effective query engine based on columnar semantic which supports an integral process from parsing SQL to generating results.Further more this query engine can handle large-scale datasets in real-time with good fault-tolerant performance.3)Design and implement a distributed task scheduling algorithm which can adapt to diverse scenario with flexibility and efficient scheduling results.This thesis contributes three innovative points: 1)Induce dataflow graph to express SQL using columnar sematic and execute tasks asynchronously without extra time delay costed by synchronous step.2)design and implement effective columnar intermediate data structure with high serializing and desterilizing performance as well as few inner memory fragments.3)Develop a novel distributed task scheduling algorithm combining deep reinforcement learning and traditional heuristic algorithms for handling workflow scheduling problem.This algorithm shares good features such as flexibility and high accuracy.Finally,this thesis tests MSDCQE including query test and algorithm performance test.For query test,this thesis uses standard database test sets from TPC-H,which are special designed for OLAP.For task scheduling algorithm,this thesis compares the designed algorithm with other classical algorithms.Test results shows that,MSDCQE outperforms 10 x than Spark-SQL in scan query and 5x in group query.In join query,MSDCQE shares the same performance with Spark-SQL.For memory consumption,SparkSQL costs more than 8x memory than MSDCQE.For scheduling algorithm,the algorithm designed in this thesis generates a better scheduling plan than other algorithms.
Keywords/Search Tags:On-line Analytical Processing, Distributed In-memory Database, Distributed Computing Model, Task Scheduling Algorithms, Deep Reinforcement Learning
PDF Full Text Request
Related items