Font Size: a A A

Research And Optimization Of Hadoop Small File Processing Technology

Posted on:2017-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhaoFull Text:PDF
GTID:2308330485969653Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, digital information showing an exponential growth, human beings have entered the era of big data. In terms of data storage and computing, the traditional way has become increasingly weak. How to efficiently store and compute massive data has become the focus of the academic and industrial circles. For the data processing and storage of high requirements, the concept of cloud computing came into being. The rapid development of cloud computing is also the storage and computing to become the hottest research area.Hadoop is the Apache foundation a top-level project, it in the distributed storage and computing out superior performance, attracted the attention of domestic and foreign, at present, more and more universities and enterprises began to Hadoop applications supporting their business and demand. Hadoop has two core content, HDFS distributed storage and MapReduce distributed computing model. Hadoop at the beginning of the design is designed for large data. But in the actual production, but there is a huge amount of small files. When the Hadoop storage of small files, it will give the main node has a huge memory pressure, affect the efficiency of the file access, and will also affect the efficiency of the MapReduce computing model.In order to solve the calculation of small file Hadoop storage waste memory resources and inefficiencies of access problems. Firstly, this paper studies the Hadoop existing small file processing technology, analyzed their advantages and disadvantages, and face the Hadoop to carry on the research and Optimization on MapReduce and HDFS and improve Hadoop storage efficiency calculation of small files. At the MapReduce level, the implementation process of MapReduce and the structure of the InputFormat in-depth study, detailed analysis of the MapReduce source code and the specific implementation of internal methods. Through in-depth study and implementation of CombineFileInputFormat abstract class, the MapReduce layer in the face of small files in the input format to merge, improve the efficiency of small files Hadoop. At the HDFS level, this paper proposes a distributed file system with independent small file processing module, which does not depend on the HDFS, and the whole module can be decoupled from the Hadoop cluster. This module combined with small files, index mapping and read, and join the cache module, improve the efficiency of file access, and indirectly improve the efficiency of MapReduce in the calculation of small files.Finally, through experimental verification, the custom CombineFileInputForm at in the processing efficiency of MapReduce is higher than other input formats. Independent small file processing module, also accelerated the access to the file, and reduce the main node of memory pressure.
Keywords/Search Tags:Hadoop, Small File, MapReduce, HdfS, Single Model
PDF Full Text Request
Related items