Font Size: a A A

Research And Improvement Of The MapReduce Framework In Cloud Computing

Posted on:2014-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:C J WangFull Text:PDF
GTID:2268330425472448Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, it changes to a serious problem that how to efficiently process large amouts of data when information growth rapidly. The traditional way to handle large data is costly. However, the Hadoop platform, birthed in the cloud computing revolution, could easily cope with huge amouts of data, structured or unstructured, and parallel process massively. Hadoop brings a more convenient, cheep, fast and safe way for massive data processing. Therefore, researching on how to improve the stability of the framework and optimize the system performance has more and more significance.In this thesis, we started from the Hadoop framework structure, made the following researches on the load balancing of nodes and task scheduling optimization problem.Firstly, we analysed the mapping process of the Map task intermediate results, point out the problem of data skew, and then put forword two methods of balancing data mapping:the fair load online model and the fair load offline model. The online model needed to be pre-analysed of the distribution of keys, while the offline model needed to measure the performance of the task slots. Then we proposed a method of measure node performance.Secondly, analyse the data locality issue, point out the importance of data locality and the affection with heterogeneous environment. Study the task scheduling and analyse three exising scheduling algorithm, propose a node delay matching scheduling algorithm to improve the data locality matching degree.Finally, we built the Hadoop distribute environment, and experimented in the cluster to compare the new method whith load balancing and node delay scheduling to the original scheduling algorithm. And, the experiments showed that the new improvement had better data locality and response time for most types of jobs.In this thesis, we studied the intermediate data mapping and task scheduling on the Hadoop platform, analysed the defects and performance bottlenecks of the framework. We proposed some improved algorithms and experimented on the cluster, and provide new ideas and methods for the optimization and upgrading of Hadoop platform performance.
Keywords/Search Tags:cloud computing, Hadoop, MapReduce, delay schedule
PDF Full Text Request
Related items