Research On The Performance And Optimization Of MapReduce Model In Hadoop Platform

Posted on:2015-01-05

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Yao

Full Text:PDF

GTID:2308330473451883

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the extremely increasing data volume and the burden of data access in BIG DATA era, the demand on the performance of computation is soaring. As an effective solution, Cloud Computation has gained rapid development since it was proposed. Cloud Computation has led Information-technology to a new arena because of its almost infinite storage capacity and computation power. Hadoop has also gained wide aknowlegement and applications as a current main platform of Cloud Computation.Hadoop is a high-performance data processing platform with high usability, good scalability, and excellent extensibility, and also has the advantages of low cost and open source. The realization of Hadoop has two cores: HDFS(Hadoop Distributed File System) and MapReduce. HDFS is a distributed file system supporting super big files, stream-line access, and high throughput; MapReduece is a rapid parallel programming model which hyalinizes the realization of parallel access and only provides users with simple interfaces.Firstly, this dissertation introduces the background of Hadoop platform, including the start and the development in technologies areas, and the application and the prospect in application areas. Then, the crucial technologies in Hadoop platform--- HDFS, MapReduce and Scheduler are studied. Based on the previous preliminary, this dissertation indicates three optimizing levels, i.e., program level, parameter level, and system level. There are many optional choices in system level and parameter level, which will be elaborated in Chapter Three.The resource management in Hadoop binds the memory and CPU resources, and then divides the resources into Map Slot model and Reduce Slot model based on task type. The realization of this mechanism is simple, but has the drawbacks of resource hoarding and low utilization rate. Chapter Four of this dissertation define two resource models--- memSlot and cupSlot, to unbind the resources. The proposed models assign resources based on realistic requirements of Map/Reduce. In a cluster of seven PCs, the proposed scheme achieves 3.5% memory and 4.3% CPU utilization improvements when 21 GB log data was processed, showing the effectiveness to solve the hoarding phenomenon.Because MapReduce will produce a lot of sorting that are usually recursive, the performance consumption is high. Specific to the problem, Chapter Five clarifies the flowchart of Shuffle phase, and proposes a more efficient Count Sort to replace Quick Sort. Meanwhile, Chapter Five also branches the Shuffle based on Combiner, in which one branch reduces the performance consumption through eliminating Quick Sort in Spill phase and Merge Sort in Combine phase, and another branch improves the efficient through processing Combiner in advance. When dealing with 21 GB log data in a cluster of seven PCs, these two branches both approximately achieve half hour’s improvement.

Keywords/Search Tags:

Hadoop, MapReduce, Resource Management, Task Execution

PDF Full Text Request

Related items

1	Research On Task Scheduling Algorithms Based On Pre-Release Resource List In Hadoop
2	Research On MapReduce Scheduler For Iterative Applications
3	Research On Task Scheduling Algorithm Under MapReduce Framework
4	A Resource Management Strategy Based On Data Compression Ratio For MapReduce On Hadoop Clusters
5	MapReduce Speculation Execution Algorithm In Heterogeneous Environments
6	Research On Resource-aware Skew Mitigation For Mapreduce
7	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
8	Research On Improving The Fault Tolerance Performance In MapReduce
9	The Research Of Performance Optimization Of Hadoop In Big Data
10	Research On Scheduling Algroithm In Hadoop Mapreduce