Font Size: a A A

Scheduler and I/O Based Performance Tuning Approach for Hadoop

Posted on:2013-08-07Degree:M.SType:Thesis
University:University of California, IrvineCandidate:Wan, YipingFull Text:PDF
GTID:2458390008472144Subject:Computer Science
Abstract/Summary:
Hadoop is emerging as a phenomenal open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It is a flexible and highly available architecture for large-scale computation and data processing on a network of commodity hardware. The core of Hadoop is MapReduce, which becomes more and more popular as a programming model for large-scale parallel computing. The key benefits of MapReduce are that it automatically parallels the computing and also handles failures. Programmers would not have to think about these complexities but focus on the real job, which would save huge amount of manpower and operational cost. Although Hadoop is highly available, intelligent enough to split the job and parallel the computing, its performance is an area to be further tuned up. As Hadoop's performance is closely bounded to its load balancing, the task scheduler plays a critical role to decide where to run the task and how to handle stragglers to ensure the entire job processing is performing well. Also because Hadoop's I/O mode is streaming, which is not very efficient when the data block is located on local node, the overhead of transferring data via intermediary protocol lowers performance.;This thesis proposes a solution to enhance Hadoop load balancing mechanism by adding intelligent algorithms to make smarter scheduling possible, which works especially well under the heterogeneous environment that machines' performance vary a lot and there are stragglers dragging down the job processing performance. Benchmark results show that the improved scheduling mechanism can boost the performance by factor of 2 or more when having a few stragglers in the cluster.;The thesis also proposes enhanced I/O mode to improve Hadoop's streaming-only I/O mode, by using direct I/O for the local node reads, which saves the overhead of transferring data through intermediary data transfer protocol, essentially it improve the overall Hadoop performance. Benchmarks show that with the enhance I/O mode that strengthened by direct I/O, both Hadoop I/O performance and MapReduce performance improves, the improvement varies from 0.1x to 1x depending on the file size. The larger file we deal with, the better performance improvement will be brought by the enhanced I/O mode.
Keywords/Search Tags:I/O, Performance, Hadoop, Data, Computing
Related items