Font Size: a A A

Analysis And Optimization Of Massive Data Processing On High Performance Computing Architecture

Posted on:2012-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:H HuangFull Text:PDF
GTID:2218330362960442Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Generally speaking, the approach of building Massive Data Processing Paradigm on High Performance Computers (HPCs) is explored, and how to deal with Massive Data Processing applications efficiently is discussed in this article.First of all, the difficulty and the significance of the Massive Data Processing problem on HPCs is described, and the necessity, feasibility, and problems that may be encountered of deploying MapReduce Paradigm on HPCs in order to run data-intensive applications are analyzed.Secondly, the performance of MapReduce Paradigm on HPCs is evaluated by experiments. The evaluation is done under every different environment, i.e. different cluster scale and different storage system, and by running different styles of applications. According to these evaluations, it is concluded that the I/O performance of Distributed File System (DFS) can scale linearly when the number of nodes in the cluster increases, while the I/O performance of the centralized storage subsystem is limited to the size of its disk array. So in the case of large number of nodes, applications using DFS will get better performance.And then, a Resource-Aware MapReduce Performance Prediction Model is built. By analyzing the detailed execution process in each phase of a MapReduce job, the execution performance of the MapReduce application which is mainly reflected by the total time cost of jobs, is correlated with application characteristic parameters and cluster hardware environment characteristic parameters. Resorting to that model, not only the time cost of each stage with different architectures and hardware resources while processing different types of data-intensive applications, but also the ratio of the computation cost and data I/O cost in different stages, can be calculated. By using the Resource-Aware MapReduce performance prediction model, we can predict the best performance that can be achieved in specific hardware environment when running specific MapReduce apllication, find out the bottlenecks that set limits on the MapReduce performance, as well as get to know how the performance varies after augmenting a specific kind of hardware resources.Following that, an optimization solution for the MapReduce Paradigm on HPCs is presented, due to the limited data I/O capacity of HPCs, which probably can't meet the requirement of data-intensive application. The solution is consisted of two parts: Intermediate Results Network Transfer Optimization and Intermediate Results Localized Storage Optimization. Finally, the model is verified by the combination of RA-MapReduce performance prediction model and experimental data. The effectiveness of the Intermediate Results Network Transfer Optimization and Intermediate Results Localized Storage Optimization are demonstrated respectively, by both model analyzing and experimental data. In our experiment, the I/O of the storage subsystem is the bottleneck of the performance of the whole system. So the Intermediate Results Localized Storage Optimization could improve the system performance by mitigating the burden of storage subsystem. When the Intermediate Results Localized Storage Optimization is introduced, the performance of the Terasort benchmark under the centralized storage subsystem can be improved by 32.5%.
Keywords/Search Tags:High Performance Computer, massive data processing, MapReduce paradigm
PDF Full Text Request
Related items