Font Size: a A A

Research On Scheduling Technology For MapReduce Computing Model

Posted on:2016-07-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:C J WangFull Text:PDF
GTID:1318330536467203Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
MapReduce is the most popular cloud computing framework at present,of which the optimization of task scheduling mechanism has been a research hotspot in cloud computing since itcan improve the efficiency of MapReduce significantly.In this paper,four important problems in this area are discussed: the data placement in MapReduce,Map task scheduling,the acceleration on imprecise applications and the parallel execution for serial programs.In MapReduce,the data placement is only in consideration of load balancing and data localization.However,the data placement can affect both the efficiency to process Map tasks and the data transmission time in the shuffle phase.To improve the efficiency to process Map tasks and optimize the shuffle process,we propose an optimal data placement approach(OPTAS)which can minimize the total time of Map and Shuffle phases in MapReduce.The main idea is:(1)To divide all kinds of data placement instances(DPIs)of a MapReduce job into multiple sub-spaces based ona finite number of discrete map time values inall of the data placement instances and then to obtain the optimal data placement instance by searching and comparingsub-space optimal DPIs,which can improve the search efficiency;(2)To calculatethe lower bound of the shuffle time of the sub-space optimal DPI and then construct the sub-space optimal DPI based on the lower bound and the map time of the sub-space,which can improve the speedto find out the sub-space optimal DPI;(3)To construct the sub-space optimal DPIs in accordance with the order of the map time valuesand thenthe optimal data placement instance just is the sub-space optimal DPI which has the first minimum value of the job time.The approach can significantly reduce the number of sub-space optimal DPIs to be constructed and improve the search efficiency for the optimal data placement instance.Experimental results show that,OPTAS can optimize the job time effectively and find out the optimal data placement instance quickly.The existing MapReduce scheduler assigns tasks to nodes when they are idle.There exist two problems in this intuitive approach:(1)The tail tasks in the queue have high possibility to be assigned to the slow nodes,which will lead to “long tail”;(2)The extensive scheduling operations involve high overhead.To solve these problems,we propose an optimal scheduling approach based on the node performance,which can effectively improve the efficiency of MapReduce and reduce scheduling overhead.The main idea is:(1)To obtain the theoretical values of the optimal scheduling schemeby the solution to the equation for the optimal scheduling scheme based on the known performance of nodes;(2)The task scheduling process is divided into two steps when there exist non-integer theoretical values: the integer part of each theoretical value isused to schedule the same number of tasksto the corresponding node directly and the decimal partsare collected and then the left tasks are scheduled according to the performance of nodes,which can also obtain the optimal scheduling scheme;(3)A self-assessment method for the performance of nodes is proposed,where the more map tasks are processed,the more accurate performance values can beobtained;(4)A batch scheduling approachis designed to reduce task scheduling overhead and the task schedulingoperations are performed only when some node completes all the assigned map tasks.Thus the new scheduling approach can obtain the optimal scheduling schemewiththe minimal scheduling overheadeven if the estimated performance values ofnodes are not very accurate.Experimental results show that this approach can effectively solve the two problems in the existing task scheduling approach of MapReduce.Imprecise applications are a kind of special applications processed in MapReduce and approximate results based on partial data can satisfy the requirement of imprecise applications.However,all the data must be processed in MapReduce.We propose the computing modelof MapCheckReduceand designrelated support environment for the efficient processing of imprecise applications in MapReduce.The main idea is:(1)Aconditional decision stepis addedinto MapCheckReduce and it decides whether to cancel the remaining Map tasks according to the completed Map tasks;(2)A kind of Check mechanism is designed to support the conditional decision in the Map process which not only receive and analyze the Map task messages,but also sends a “CancelRemaining Map Tasks” instruction to the task scheduler according to the analysis results;(3)The task scheduler in Map CheckReduceis designed to be able toschedule tasks and cancel all the remaining map tasks;(4)MapCheckReduce programming interfaces are proposed to support users define the termination conditions of the map process.A MapCheckReduce prototype hasbeenimplemented based on Hadoop MapReduce.Experimental results demonstrate the feasibility and effectiveness of MapCheckReduce.The new model permits the termination of the map process when some conditions are satisfied and then the Reduce process can complete the job based on the results of the completed map tasks.Since the size of dataset is increasing violently,it is in urgent need to execute the serial programs in parallel on multiple machines.However,traditional parallel computing techniques require the application staff and parallel programmers work together to refactor the serial programs.Therefore,we propose a parallel computing framework,named MEX,which can process data in parallel manner with serial executable programs by means of the distributed storage,task scheduling and the synchronization control mechanism.The main idea is:(1)The data is distributed on multiple nodes;(2)The original serial executable program is deployed in multiple nodes and can beexecuted in parallelon them and tasks are managed by way of centralized scheduling;(3)Amechanism for data localization is designed based on data prefetchingwhich supports efficient task processing and transparent access to remote data andthe dynamic balance of computing resources;(4)A fault-tolerant mechanism based on the process detection is designed to find and deal withtasks with some exceptionor failed tasks by monitoring their process statuses.MEX prototype has been implemented based on Hadoop MapReduce.Experimental results verify the feasibility and effectiveness of MEX.
Keywords/Search Tags:CloudComputing, Data Placement, Task Scheduling, Acceleration on Imprecise Applications, Parallel Execution Based on Serial Programs
PDF Full Text Request
Related items