Font Size: a A A

Study On Deploying And Optimizing MapReduce Framework On High Performance Computer

Posted on:2014-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J YuFull Text:PDF
GTID:2308330479979487Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the big data era, the size of data in areas such as scientific research, industrial application, is exploding exponentially, while the demand for complex data analysis is increasingly intense. Current high performance computing occupies a pivotal position in the scientific and technological service industry, and with the continuous expansion of its application, high performance computer has become an important platform of large-scale data processing.However, high performance computer has some inevitable drawbacks to deal with data intensive applications. As a matter of fact, most high performance computers adopt centralized storage systems(such as Lustre file system, etc). This storage-centric architecture simplifies the complexity of programming, but it often causes I/O bottlenecks while processing large-scale data, which would restrict the overall performance of the system. With the increasing complexity and scale of high performance computer systems, the average time between failures are becoming shorter, lowering the availability of the system and affecting the quality of service of Supercomputing Center.In this paper, we propose to deploy the MapReduce framework on high performance computers to solve the above problems. As a large-scale data processing framework, MapReduce has drawed immediate attention of the industry and academy once publicated, and quickly became a de facto standard for big data processing. MapReduce framework takes fault tolerance into prior consideration, and solves the problem of availability and scalability in system level. Its "Move Computation to Data" strategy eases the I/O pressure of data movement, so large-scale data analyzing and processing can be done with high efficiency.This paper mainly focuses on the following studies:(1) Study the differences between high performance computer with centralized storage systems and server cluster, demonstrate the significance of deploy MapReduce framework on high performance computer, explore the methods of deploying MapReduce framework.(2) Analyze the characteristics of data flow after deploying MapReduce framework on centralized storage to increase the efficiency of accessing remote data and avoid duplication of data.(3) Analyze the characteristics of centralized storage system and MapReduce framework to tune Lustre to a better performance.(4) Utilize virtual memory disk to store temporary data and intermediate data, further alleviate I/O pressure of Lustre.(5) Verify the effect of optimization strategies on high performance computer TH-1A, indicate that the MapReduce Framework on HPC could be applied to actual production to deal with MapReduce applications.
Keywords/Search Tags:high performance computing, MapReduce framework, Hadoop, memory optimization
PDF Full Text Request
Related items