Font Size: a A A

Research On The Application Of Cloud Computing Based On Hadoop

Posted on:2015-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:2308330473453114Subject:Information security
Abstract/Summary:PDF Full Text Request
With the rapid development of e-commerce, social networking and other Internet applications in recent years, Cloud Computing as an important innovation of the information industry has attracted people’s attention once it was proposed. Many industry giants have introduced the products of Cloud Computing. Among these products, Hadoop platform as the open-source implementation of Google’s MapReduce and GFS has got the praise of information industry. MapReduce and HDFS as its key components provide the reliable distributed computing and data storage. But with the development of the industry, they also face the problem of declining efficiency in certain applications. It will affect the long-run development of Hadoop platform. In this paper, HDFS and MapReduce will be the main research object. The main content is as follows:In this thesis, we firstly made a detailed presentation of Cloud Computing and Hadoop. It includes the concept, background, features and deployment model of Cloud Computing, and details analysis of the framework, key technologies and architecture. Then we introduce the background, its key components and framework of Hadoop platform, and focus on analysis of the HDFS and MapReduce, including HDFS’s architecture, the read and write operations of the files and how to ensure the data integrity, etc. In the discussion of the MapReduce, we focus on the basic principle of programming model, calculation process and implementation framework, then discusse and study MRv1 and MRv2.Based on the introduction of MapReduce, we analyze of its performance bottleneck: MapReduce’s Mappers will generate a lot of intermediate results, but at the same time, Reducers are not called to merge these intermediate results, thus increasing the burden on the network of transferring the intermediate results and causing Reducer in idle and reducing the efficiency of MapReduce in general. To solve this problem, we propose optimization based on EMR. In its implementation framework, we use MPI to enable Reducer and Mapper to run in parallel with the processing of intermediate results, and also describe the implementation of MPI.We analyze of the HDFS’s performance bottleneck in handling a large number of small files and we propose the optimization program: we use HAR to integrate metadata of small files to a directory, reduce the amount of catalog metadata to reduce NameNode’s memory usage.We also research and study NameNode’s metadata.To verify the performance improvement of the above optimization, we use the Hadoop platform to conduct the experiment. In the first experiment, we make the runningtime as the criteria. The results show the optimization EMR has greater efficiency. In the second experiment, we use the same platform and the results show that our optimization program can reduce the number of small files problem metadata in handling a large number of small files to improve the efficiency of Name Node.
Keywords/Search Tags:Cloud Computing, Hadoop, MapReduce, HDFS
PDF Full Text Request
Related items