Analysis And Optimization Of Massive Data Processing On High Performance Computing Architecture

Posted on:2012-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:H Huang

Full Text:PDF

GTID:2218330362960442

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Generally speaking, the approach of building Massive Data Processing Paradigm on High Performance Computers (HPCs) is explored, and how to deal with Massive Data Processing applications efficiently is discussed in this article.First of all, the difficulty and the significance of the Massive Data Processing problem on HPCs is described, and the necessity, feasibility, and problems that may be encountered of deploying MapReduce Paradigm on HPCs in order to run data-intensive applications are analyzed.Secondly, the performance of MapReduce Paradigm on HPCs is evaluated by experiments. The evaluation is done under every different environment, i.e. different cluster scale and different storage system, and by running different styles of applications. According to these evaluations, it is concluded that the I/O performance of Distributed File System (DFS) can scale linearly when the number of nodes in the cluster increases, while the I/O performance of the centralized storage subsystem is limited to the size of its disk array. So in the case of large number of nodes, applications using DFS will get better performance.And then, a Resource-Aware MapReduce Performance Prediction Model is built. By analyzing the detailed execution process in each phase of a MapReduce job, the execution performance of the MapReduce application which is mainly reflected by the total time cost of jobs, is correlated with application characteristic parameters and cluster hardware environment characteristic parameters. Resorting to that model, not only the time cost of each stage with different architectures and hardware resources while processing different types of data-intensive applications, but also the ratio of the computation cost and data I/O cost in different stages, can be calculated. By using the Resource-Aware MapReduce performance prediction model, we can predict the best performance that can be achieved in specific hardware environment when running specific MapReduce apllication, find out the bottlenecks that set limits on the MapReduce performance, as well as get to know how the performance varies after augmenting a specific kind of hardware resources.Following that, an optimization solution for the MapReduce Paradigm on HPCs is presented, due to the limited data I/O capacity of HPCs, which probably can't meet the requirement of data-intensive application. The solution is consisted of two parts: Intermediate Results Network Transfer Optimization and Intermediate Results Localized Storage Optimization. Finally, the model is verified by the combination of RA-MapReduce performance prediction model and experimental data. The effectiveness of the Intermediate Results Network Transfer Optimization and Intermediate Results Localized Storage Optimization are demonstrated respectively, by both model analyzing and experimental data. In our experiment, the I/O of the storage subsystem is the bottleneck of the performance of the whole system. So the Intermediate Results Localized Storage Optimization could improve the system performance by mitigating the burden of storage subsystem. When the Intermediate Results Localized Storage Optimization is introduced, the performance of the Terasort benchmark under the centralized storage subsystem can be improved by 32.5%.

Keywords/Search Tags:

High Performance Computer, massive data processing, MapReduce paradigm

PDF Full Text Request

Related items

1	Study On Deploying And Optimizing MapReduce Framework On High Performance Computer
2	The Research And Implementation Of Comprehensive Mapreduce
3	Research Of Massive Data Processing In The Vessel Monitoring System
4	Research Of Massive Data Processing Model In CDMA Packet Domain Based On Hadoop
5	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
6	Massive Data Processing Application Based On Hadoop
7	Data Processing Of Complex Structured Data Based On MapReduce
8	The Research Of Job Scheduling Algorithm In Mapreduce-styled Massive Data Processing Platform
9	Research On The Concurrent Processing Of Massive Data Set
10	Design And Optimization Of Massive Relational Data Processing Technology Based On MapReduce