Performance Tuning of MapReduce Programs

Posted on:2016-12-23

Degree:Ph.D

Type:Dissertation

University:North Carolina State University

Candidate:K.C., Kamal

Full Text:PDF

GTID:1478390017477114

Subject:Computer Science

Abstract/Summary:

This dissertation addresses performance tuning of MapReduce programs. The MapReduce framework simplifies processing of large datasets across a large number of machines as a user only needs to implement map and reduce functions to create a scalable distributed application. The framework takes care of all other operations such as creating tasks for each function, parallelizing the tasks, distributing data, and handling machine failures. MapReduce programs run in both Hadoop and YARN. Hadoop is a computing framework built on the design of the original MapReduce framework and YARN is a generalized container oriented large scale data processing framework that runs MapReduce applications.;In this dissertation, we first characterize MapReduce programs based on CPU and IO usage of a map task. Our findings show that, based on the similarity of performance of applications under different task parallelism setting, MapReduce applications can be grouped into three categories --- IO-intensive, Balanced, and CPU-intensive using cutoffs for the CPU usage. Applications belonging to each group exhibit similar map completion time characteristics.;Second, we develop a static tuning method for setting task parallelism for MapReduce programs. We evaluate thirteen MapReduce applications from all three application categories. We used two clusters with different architecture and obtained the same finding that IO-intensive applications have best normalized map task parallelism of below 1, Balanced have normalized map task parallelism of 1 or above, and CPU-intensive have normalized map task parallelism close to 1. Normalized map task parallelism is the ratio of the number of map tasks with respect to the number of CPU contexts present in a system. This static method of using task parallelism values based on the category of an application is more efficient than exhaustively profiling or using a default setting.;Third, we develop a feedback controller based dynamic tuning approach to adjust the task parallelism during runtime execution of MapReduce applications. For this, we measure map completion time versus metric values and identify three instantaneously measurable operating system metrics---user CPU, blocked processes, and context switch value, as indicators of applications performance during runtime execution. Using these metrics and a combined value called score, we develop PID controller for Hadoop and Waterlevel, PD, and PD+pruning controllers for YARN. Our findings show that dynamically changing task parallelism using feedback controllers achieves performance close to using the optimal task parallelism values, and achieves performance better than default and best practices, while having an added benefit of not requiring application profiling.;Fourth, we study the performance effects of data scaling and configuration parameters on MapReduce programs when running in a large cluster having 540 nodes. We find that IO intensive applications do not scale when data size increases. Configuration parameters that change task parallelism and overlap affect application performance. This study also uncovers issues that occur at large scale such as production of huge logs, need for changing allocation strategy for tasks that coordinate application execution, and need for using data types that can handle calculations involving large numbers without causing over ows.;Fifth, we develop a log compression technique that compresses the log messages online during the execution of a MapReduce application. It does so by encoding each log message by log identifier of log message templates derived from Hadoop/YARN's source code. Our findings show that this technique reduces the log size to one-third of the raw uncompressed log size with 3% overhead on application completion time.

Keywords/Search Tags:

Mapreduce, Performance, Tuning, Task parallelism, Application, Log, Completion time, Large

Related items

1	Improve Parallelism Of Task Execution To Optimize Utilization Of MapReduce Cluster Resources
2	Performance Tuning In Finacial Applications With Large Volumn Of Data
3	Research On Key Technologies Of Performance Tuning Of Jobs In Distributed Data Processing System
4	Reducing the encoding time for h.264 baseline profile using parallel programming techniques
5	The Research On High Performance Task Scheduling Technology Based On Mapreduce In Cloud Computing
6	The Research And Implementation Of Parallelism Of Information Retrieval Related Algorithms Based On Mapreduce
7	Research On MapReduce Performance Optimization Based On Hadoop
8	The Research And Implementation Of Parallel Computing Method On MPCore Multicore Processor
9	A Deterministic And Scalable MapReduce For Multicore Systems
10	VLSI Architecture Design And Real-time Implementation For SIFT