Font Size: a A A

The Performance Optimization And Improvement Of MapReduce In Hadoop

Posted on:2012-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:R B HeFull Text:PDF
GTID:2218330368958670Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today, the Internet is a data explosion era. People's work, life and entertainment keep in touch with network tightly. It makes data scale on the internet increase dramatically and enrich the application type. The seemingly chaos of data, in fact, holds enormous business opportunities. As enterprises, future success largely depends on whether it can extract value from the data or not. The coming problem is that data processing ability of single computer can't meet the current mass data application processing requirements. Distributed computing based on Large-scale computer cluster has been the main route to improve processing performance of future data.Due to the reliable stability, high-efficiency distributed parallel processing ability, easy extension and open source, Hadoop has been the mainstream open source clouds computing platform in just three years. But the development time of Hadoop is relatively short, there is much improvement room. This paper thoroughly analyzes one of the Hadoop's core technologies, MapReduce computation model. According to the flaws of temporary data management and control which the Map outputs, the optimization and improvement are made. It aims to solve the performance bottleneck generated by the large scale of middle data quantity and imbalance of data distribution when the program is running. Furthermore, it can promote program performance and optimize resource.The main research contents and contribution are as follows:The domestic and overseas cloud computing development situation, application prospect and existing problems are discussed. The distributed systems such as Hadoop Distributed Computing, Grid Computing, Volunteer Computing and so on are distinguished. The paper introduces the background and frame structure of Hadoop platform. The operation mechanism of Hadoop's two core technologies, HDFS and MapReduce, are researched. After analyzing the read-write process to the data and middle data control of MapReduce, the optimization idea and improvement plan are proposed. Then, they are tested and verified by the specific case. The experimental result suggests the expected objective has been achieved, and the shortcomings of existing framework have been solved.
Keywords/Search Tags:Distrubuted computing, Hadoop, HDFS, MapReduce
PDF Full Text Request
Related items