The Performance Optimization And Improvement Of MapReduce In Hadoop

Posted on:2012-11-12

Degree:Master

Type:Thesis

Country:China

Candidate:R B He

Full Text:PDF

GTID:2218330368958670

Subject:Computer application technology

Abstract/Summary:

Today, the Internet is a data explosion era. People's work, life and entertainment keep in touch with network tightly. It makes data scale on the internet increase dramatically and enrich the application type. The seemingly chaos of data, in fact, holds enormous business opportunities. As enterprises, future success largely depends on whether it can extract value from the data or not. The coming problem is that data processing ability of single computer can't meet the current mass data application processing requirements. Distributed computing based on Large-scale computer cluster has been the main route to improve processing performance of future data.Due to the reliable stability, high-efficiency distributed parallel processing ability, easy extension and open source, Hadoop has been the mainstream open source clouds computing platform in just three years. But the development time of Hadoop is relatively short, there is much improvement room. This paper thoroughly analyzes one of the Hadoop's core technologies, MapReduce computation model. According to the flaws of temporary data management and control which the Map outputs, the optimization and improvement are made. It aims to solve the performance bottleneck generated by the large scale of middle data quantity and imbalance of data distribution when the program is running. Furthermore, it can promote program performance and optimize resource.The main research contents and contribution are as follows:The domestic and overseas cloud computing development situation, application prospect and existing problems are discussed. The distributed systems such as Hadoop Distributed Computing, Grid Computing, Volunteer Computing and so on are distinguished. The paper introduces the background and frame structure of Hadoop platform. The operation mechanism of Hadoop's two core technologies, HDFS and MapReduce, are researched. After analyzing the read-write process to the data and middle data control of MapReduce, the optimization idea and improvement plan are proposed. Then, they are tested and verified by the specific case. The experimental result suggests the expected objective has been achieved, and the shortcomings of existing framework have been solved.

Keywords/Search Tags:

Distrubuted computing, Hadoop, HDFS, MapReduce

Related items

1	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
2	Research And Implementation Of HDFS Distributed File System
3	Working Principle And Applied Research Of MapReduce
4	MapReduce Performance Research And Optimization Based On Block Aggregation
5	The Design Of The Cloud Computing System Based On Hadoop
6	The Cloud Computing Based On Hadoop Platform And Log Analysis
7	Analysis And Application Development Of Hadoop Distributed Computing Platform
8	Research On The Application Of Cloud Computing Based On Hadoop
9	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
10	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model