Scheduler and I/O Based Performance Tuning Approach for Hadoop

Posted on:2013-08-07

Degree:M.S

Type:Thesis

University:University of California, Irvine

Candidate:Wan, Yiping

Full Text:PDF

GTID:2458390008472144

Subject:Computer Science

Abstract/Summary:

Hadoop is emerging as a phenomenal open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It is a flexible and highly available architecture for large-scale computation and data processing on a network of commodity hardware. The core of Hadoop is MapReduce, which becomes more and more popular as a programming model for large-scale parallel computing. The key benefits of MapReduce are that it automatically parallels the computing and also handles failures. Programmers would not have to think about these complexities but focus on the real job, which would save huge amount of manpower and operational cost. Although Hadoop is highly available, intelligent enough to split the job and parallel the computing, its performance is an area to be further tuned up. As Hadoop's performance is closely bounded to its load balancing, the task scheduler plays a critical role to decide where to run the task and how to handle stragglers to ensure the entire job processing is performing well. Also because Hadoop's I/O mode is streaming, which is not very efficient when the data block is located on local node, the overhead of transferring data via intermediary protocol lowers performance.;This thesis proposes a solution to enhance Hadoop load balancing mechanism by adding intelligent algorithms to make smarter scheduling possible, which works especially well under the heterogeneous environment that machines' performance vary a lot and there are stragglers dragging down the job processing performance. Benchmark results show that the improved scheduling mechanism can boost the performance by factor of 2 or more when having a few stragglers in the cluster.;The thesis also proposes enhanced I/O mode to improve Hadoop's streaming-only I/O mode, by using direct I/O for the local node reads, which saves the overhead of transferring data through intermediary data transfer protocol, essentially it improve the overall Hadoop performance. Benchmarks show that with the enhance I/O mode that strengthened by direct I/O, both Hadoop I/O performance and MapReduce performance improves, the improvement varies from 0.1x to 1x depending on the file size. The larger file we deal with, the better performance improvement will be brought by the enhanced I/O mode.

Keywords/Search Tags:

I/O, Performance, Hadoop, Data, Computing

Related items

1	GPU Computing In Massive Data Processing
2	Performance Monitoring And Analysis On Hadoop-Based Distributed Computing Platform
3	The Analysis And Optimization Of Hadoop Data Processing Performance On Parallel Computing
4	Extension Of Hadoop Framework And Performance Tuning
5	Performance Evaluation And Optimization For BigData As A Service
6	The Study And Optimization Of Hadoop Framework On High Performance Computer
7	Real-time Performance Monitoring And I/O Performance Optimization Research On Hadoop Cluster
8	The Research Of Performance Optimization Of Hadoop In Big Data
9	The Research Of Improving Performance Of Hadoop Cluster
10	Research And Construction On Data Acquisition Model Of The Tourism Information Based On Hadoop Cloud Computing