Research And Optimization Of Parallel Computing Framework Based On MapReduce

Posted on:2018-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:B Hong

Full Text:PDF

GTID:2348330512982982

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Nowadays,the amount of data shows a fast increase trend,enterprises are trouble with massive data.How to efficiently address,analyze these data,even reduce the concurrency access pressure to them have become the driving force to the proposal of Big Data solutions.MapReduce,one of the most important solutions in distributed computing,through program map and reduce functions by users to process large-scale dataset.YARN is a new framework for resource management.MapReduce on YARN is also the second version of MapReduce,it abandons some old-fashion conceptions,such as slot,TaskTacker and JobTracker etc,but remains the original computing flow,and coordinates with ResourceManager,NodeManager and Container of YARN to together complete the job.This paper firstly introduces the knowledge of Hadoop platform,including HDFS architecture,as well as YARN,MapReduce working principle and Hadoop job scheduling algorithm.Base on the research above,this paper points out the shortcomings of MapReduce.During the execution of a job,MapReduce reads and processes a large amount of data,lead to frequently interacting with the disk while some tasks are getting,computing or writing data.That could cause much needless IO cost.Pending data can be transmitted across different nodes,network conditions makes certain impact on data transmission,and can affect the efficiency of job operation.It is one of the key points in MapReduce to improve the efficiency of task execution in heterogeneous environment.Aim to work out these issues,the memory-level data caching technology are introduced into MapReduce.With the help of popular cache ideas and techniques,the paper focuses on the key issues,such as architecture design of cache system,the replica placement strategy of cache data,cache replacement and task scheduling,to improve the processing speed of MapReduce.The main contributes are as follows:1.For the distributed storage in heterogeneous environment and the distributed processing of MapReduce,we design the caching system for MapReduce,and analyze the internal modules of the system.2.Considering costs of data transmission in the network and the performance difference between nodes,the work adapts the idea of the copy placement strategy of HDFS,improves its shortage by put cache data into different nodes reasonably,ensures the localized execution of task,data fault tolerance and load balancing.3.For purpose of achieving high efficiency of task scheduling,the work designs a new task scheduling strategy based on cache-aware,and arranges the execution order of tasks via their priorities.4.In order to deal with data localization of MapReduce,this paper proposes a new cache replacement strategy by considering some factors,for example,the status of file fragment and it is access frequency.

Keywords/Search Tags:

MapReduce, YARN, Cache, Replica placement, Task localization

PDF Full Text Request

Related items

1	Replica Placement Strategy Research In MapReduce Cluster
2	Research On Key Technologies Of Performance Tuning Of Jobs In Distributed Data Processing System
3	Research On Cache Placement And Task Scheduling Methods Based On Comprehensive Utility In Edge Environment
4	Research On MapReduce Program Based On YARN
5	Research On Data Placement Technology In Mapreduce-styled Data Processing Platform
6	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
7	Research And Experiment About The Data Replica Placement Algorithm In Cloud Storage System
8	Research Of Replica Location And Replica Placement For Massive Data
9	MapReduce Job Oriented Collaborative Optimization On Cloud Data Center Network Resource
10	Research On YARN Heterogeneous Cluster Management Method Based On FPGA Acceleration