Study On Deploying And Optimizing MapReduce Framework On High Performance Computer

Posted on:2014-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:J Yu

Full Text:PDF

GTID:2308330479979487

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the big data era, the size of data in areas such as scientific research, industrial application, is exploding exponentially, while the demand for complex data analysis is increasingly intense. Current high performance computing occupies a pivotal position in the scientific and technological service industry, and with the continuous expansion of its application, high performance computer has become an important platform of large-scale data processing.However, high performance computer has some inevitable drawbacks to deal with data intensive applications. As a matter of fact, most high performance computers adopt centralized storage systems(such as Lustre file system, etc). This storage-centric architecture simplifies the complexity of programming, but it often causes I/O bottlenecks while processing large-scale data, which would restrict the overall performance of the system. With the increasing complexity and scale of high performance computer systems, the average time between failures are becoming shorter, lowering the availability of the system and affecting the quality of service of Supercomputing Center.In this paper, we propose to deploy the MapReduce framework on high performance computers to solve the above problems. As a large-scale data processing framework, MapReduce has drawed immediate attention of the industry and academy once publicated, and quickly became a de facto standard for big data processing. MapReduce framework takes fault tolerance into prior consideration, and solves the problem of availability and scalability in system level. Its "Move Computation to Data" strategy eases the I/O pressure of data movement, so large-scale data analyzing and processing can be done with high efficiency.This paper mainly focuses on the following studies:(1) Study the differences between high performance computer with centralized storage systems and server cluster, demonstrate the significance of deploy MapReduce framework on high performance computer, explore the methods of deploying MapReduce framework.(2) Analyze the characteristics of data flow after deploying MapReduce framework on centralized storage to increase the efficiency of accessing remote data and avoid duplication of data.(3) Analyze the characteristics of centralized storage system and MapReduce framework to tune Lustre to a better performance.(4) Utilize virtual memory disk to store temporary data and intermediate data, further alleviate I/O pressure of Lustre.(5) Verify the effect of optimization strategies on high performance computer TH-1A, indicate that the MapReduce Framework on HPC could be applied to actual production to deal with MapReduce applications.

Keywords/Search Tags:

high performance computing, MapReduce framework, Hadoop, memory optimization

PDF Full Text Request

Related items

1	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
2	The Study And Optimization Of Hadoop Framework On High Performance Computer
3	The Performance Optimization And Improvement Of MapReduce In Hadoop
4	Extension Of Hadoop Framework And Performance Tuning
5	The Analysis And Optimization Of Hadoop Data Processing Performance On Parallel Computing
6	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
7	Research On Hadoop Performance Optimization Based On Docker Technology
8	MapReduce Performance Research And Optimization Based On Block Aggregation
9	Optimization Of High Performance MapReduce System
10	Research On Memory Optimization Technology Of Spark Computing Engine