Research On The Application Of Cloud Computing Based On Hadoop

Posted on:2015-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:J L Li

Full Text:PDF

GTID:2308330473453114

Subject:Information security

Abstract/Summary:

PDF Full Text Request

With the rapid development of e-commerce, social networking and other Internet applications in recent years, Cloud Computing as an important innovation of the information industry has attracted people’s attention once it was proposed. Many industry giants have introduced the products of Cloud Computing. Among these products, Hadoop platform as the open-source implementation of Google’s MapReduce and GFS has got the praise of information industry. MapReduce and HDFS as its key components provide the reliable distributed computing and data storage. But with the development of the industry, they also face the problem of declining efficiency in certain applications. It will affect the long-run development of Hadoop platform. In this paper, HDFS and MapReduce will be the main research object. The main content is as follows:In this thesis, we firstly made a detailed presentation of Cloud Computing and Hadoop. It includes the concept, background, features and deployment model of Cloud Computing, and details analysis of the framework, key technologies and architecture. Then we introduce the background, its key components and framework of Hadoop platform, and focus on analysis of the HDFS and MapReduce, including HDFS’s architecture, the read and write operations of the files and how to ensure the data integrity, etc. In the discussion of the MapReduce, we focus on the basic principle of programming model, calculation process and implementation framework, then discusse and study MRv1 and MRv2.Based on the introduction of MapReduce, we analyze of its performance bottleneck: MapReduce’s Mappers will generate a lot of intermediate results, but at the same time, Reducers are not called to merge these intermediate results, thus increasing the burden on the network of transferring the intermediate results and causing Reducer in idle and reducing the efficiency of MapReduce in general. To solve this problem, we propose optimization based on EMR. In its implementation framework, we use MPI to enable Reducer and Mapper to run in parallel with the processing of intermediate results, and also describe the implementation of MPI.We analyze of the HDFS’s performance bottleneck in handling a large number of small files and we propose the optimization program: we use HAR to integrate metadata of small files to a directory, reduce the amount of catalog metadata to reduce NameNode’s memory usage.We also research and study NameNode’s metadata.To verify the performance improvement of the above optimization, we use the Hadoop platform to conduct the experiment. In the first experiment, we make the runningtime as the criteria. The results show the optimization EMR has greater efficiency. In the second experiment, we use the same platform and the results show that our optimization program can reduce the number of small files problem metadata in handling a large number of small files to improve the efficiency of Name Node.

Keywords/Search Tags:

Cloud Computing, Hadoop, MapReduce, HDFS

PDF Full Text Request

Related items

1	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
2	The Cloud Computing Based On Hadoop Platform And Log Analysis
3	Research On The Application Of Cloud Computing Based On Hadoop
4	Working Principle And Applied Research Of MapReduce
5	MapReduce Performance Research And Optimization Based On Block Aggregation
6	Research And Application Of The Characteristics Of Distributed Computing Of OSS/BSS In The Cloud Deployment
7	The Performance Optimization And Improvement Of MapReduce In Hadoop
8	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
9	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
10	Research And Design Of Video Cloud Transcoding System Based On Hadoop