Research On Scan Scheduling In MPP Databases Over Distributed File Systems

Posted on:2018-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:K Guo

Full Text:PDF

GTID:2428330515489732

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

MPP database based on relation database has begun a good solution because of its perfect support for SQL standards and the feature of massive parallel processing.However,because of its underlying local file system,so it cannot fully meet the requirements of massive data storage.In aspect of massive data storage and management,distributed file systems have performed so outstanding in the reliability,availability and scalability that they are increasingly being adopted.Therefore,MPP databases over distributed file systems have become one of research hotspots currently.In the relation databases,scan operation almost is the bottom one of all queries.The operations of executing queries always contain data scanning.It is those execution units scheduled by query engine that are responsible for executing queries to scan the data stored in distributed file systems.Before executing scan operations,it is necessary to schedule execution units in order to determine which data blocks to scan.When executing scan operations,different execution units scan their own data blocks according to scheduling results.Because the distribution of data blocks on distributed file systems differ,if the execution unit and the block to be scan are not on the same physical node,it will trigger network read and produce network latency,then affect the executive efficiency of queries.This paper focuses on how to carry out the scan scheduling more effectively in MPP databases over distributed file systems.This paper chooses a mainstream MPP system HAWQ as its research object and discusses scan scheduling problem of query in HAWQ.In this paper,aiming to the procedure of scan scheduling in HAWQ,scan scheduling problem is analyzed and defined firstly.Then a problem model is built by using the formal description and some key factors of this problem are summarized.Currently,the scheduling method is based on the continuity of data blocks in files,but this method only focuses on maximizing reading local data replicas and doesn't consider nodes workload well.Therefore,in the following,a novel scheduling method based on nodes workload is proposed,which takes data locality and nodes workload into consideration.On the one hand,the phase of data locality scheduling ensures that the scheduling results meet good data locality.On the other hand,rescheduling on middle scheduling results based on nodes workload attains the goal of reducing the makespan of scanning data.Finally,in simulation experiments,this paper chooses data locality and makespan as measurement metrics to evaluate the overall performance of the two methods.The results show that the method based on n.odes workload outperforms better and could produce those query scheduling strategies with less makespan in nine test cases.The average value of optimization is 25%,which achieves the expected goal.

Keywords/Search Tags:

Distributed File System, MPP Database, Query Scheduling, Workload Optimization

PDF Full Text Request

Related items

1	TETRIS: Intelligent database workload manager with multi-objective query optimization
2	Design And Implementation Of Query Optimizer For Massive Distributed Columnar Database
3	Research On Query Optimization Technology In Distributed Real Time Database
4	Research On Data Query Processing And Optimization In Distributed Database
5	Optimization Processing Techniques For Multi-tenant Query Workload In Data Market
6	Design And Implementation Of Query Optimization Module For Distributed Column Database Based On Memory
7	Distributed Joins And Optimization For BIG Table Based On Database OceanBase
8	Analysis And Optimization Of Data I/O Pass In The Distributed File System
9	The Research And Application Of Query Optimization In Distributed Database System
10	Optimizing Query Processing In Distributed In-Memory Databases