Font Size: a A A

Research On Scan Scheduling In MPP Databases Over Distributed File Systems

Posted on:2018-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:K GuoFull Text:PDF
GTID:2428330515489732Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
MPP database based on relation database has begun a good solution because of its perfect support for SQL standards and the feature of massive parallel processing.However,because of its underlying local file system,so it cannot fully meet the requirements of massive data storage.In aspect of massive data storage and management,distributed file systems have performed so outstanding in the reliability,availability and scalability that they are increasingly being adopted.Therefore,MPP databases over distributed file systems have become one of research hotspots currently.In the relation databases,scan operation almost is the bottom one of all queries.The operations of executing queries always contain data scanning.It is those execution units scheduled by query engine that are responsible for executing queries to scan the data stored in distributed file systems.Before executing scan operations,it is necessary to schedule execution units in order to determine which data blocks to scan.When executing scan operations,different execution units scan their own data blocks according to scheduling results.Because the distribution of data blocks on distributed file systems differ,if the execution unit and the block to be scan are not on the same physical node,it will trigger network read and produce network latency,then affect the executive efficiency of queries.This paper focuses on how to carry out the scan scheduling more effectively in MPP databases over distributed file systems.This paper chooses a mainstream MPP system HAWQ as its research object and discusses scan scheduling problem of query in HAWQ.In this paper,aiming to the procedure of scan scheduling in HAWQ,scan scheduling problem is analyzed and defined firstly.Then a problem model is built by using the formal description and some key factors of this problem are summarized.Currently,the scheduling method is based on the continuity of data blocks in files,but this method only focuses on maximizing reading local data replicas and doesn't consider nodes workload well.Therefore,in the following,a novel scheduling method based on nodes workload is proposed,which takes data locality and nodes workload into consideration.On the one hand,the phase of data locality scheduling ensures that the scheduling results meet good data locality.On the other hand,rescheduling on middle scheduling results based on nodes workload attains the goal of reducing the makespan of scanning data.Finally,in simulation experiments,this paper chooses data locality and makespan as measurement metrics to evaluate the overall performance of the two methods.The results show that the method based on n.odes workload outperforms better and could produce those query scheduling strategies with less makespan in nine test cases.The average value of optimization is 25%,which achieves the expected goal.
Keywords/Search Tags:Distributed File System, MPP Database, Query Scheduling, Workload Optimization
PDF Full Text Request
Related items