Font Size: a A A

Research And Optimization Of Parallel Computing Framework Based On MapReduce

Posted on:2018-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:B HongFull Text:PDF
GTID:2348330512982982Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays,the amount of data shows a fast increase trend,enterprises are trouble with massive data.How to efficiently address,analyze these data,even reduce the concurrency access pressure to them have become the driving force to the proposal of Big Data solutions.MapReduce,one of the most important solutions in distributed computing,through program map and reduce functions by users to process large-scale dataset.YARN is a new framework for resource management.MapReduce on YARN is also the second version of MapReduce,it abandons some old-fashion conceptions,such as slot,TaskTacker and JobTracker etc,but remains the original computing flow,and coordinates with ResourceManager,NodeManager and Container of YARN to together complete the job.This paper firstly introduces the knowledge of Hadoop platform,including HDFS architecture,as well as YARN,MapReduce working principle and Hadoop job scheduling algorithm.Base on the research above,this paper points out the shortcomings of MapReduce.During the execution of a job,MapReduce reads and processes a large amount of data,lead to frequently interacting with the disk while some tasks are getting,computing or writing data.That could cause much needless IO cost.Pending data can be transmitted across different nodes,network conditions makes certain impact on data transmission,and can affect the efficiency of job operation.It is one of the key points in MapReduce to improve the efficiency of task execution in heterogeneous environment.Aim to work out these issues,the memory-level data caching technology are introduced into MapReduce.With the help of popular cache ideas and techniques,the paper focuses on the key issues,such as architecture design of cache system,the replica placement strategy of cache data,cache replacement and task scheduling,to improve the processing speed of MapReduce.The main contributes are as follows:1.For the distributed storage in heterogeneous environment and the distributed processing of MapReduce,we design the caching system for MapReduce,and analyze the internal modules of the system.2.Considering costs of data transmission in the network and the performance difference between nodes,the work adapts the idea of the copy placement strategy of HDFS,improves its shortage by put cache data into different nodes reasonably,ensures the localized execution of task,data fault tolerance and load balancing.3.For purpose of achieving high efficiency of task scheduling,the work designs a new task scheduling strategy based on cache-aware,and arranges the execution order of tasks via their priorities.4.In order to deal with data localization of MapReduce,this paper proposes a new cache replacement strategy by considering some factors,for example,the status of file fragment and it is access frequency.
Keywords/Search Tags:MapReduce, YARN, Cache, Replica placement, Task localization
PDF Full Text Request
Related items