The Research And Analysis Of Hadoop Small File Processing Method

Posted on:2016-07-02

Degree:Master

Type:Thesis

Country:China

Candidate:S M Li

Full Text:PDF

GTID:2308330461991802

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet today, every major companies are stored vast amounts of raw data via the Internet or an intranet system. At the same time, there are massive small files in these data, such as system log files, static web pages, the user’s personal avatar, and many small files etc. Hadoop can be used to store and calculate the mass of big data.Hadoop is an open source distributed infrastructure, and it mainly consists of distributed file systems HDFS and distributed computing framework MapReduce. Hadoop is good at processing massive amounts of data, but when dealing with massive small files, due to the design of Hadoop single namenode framework which simplifies Hadoop system, the performance of processing massive small files is not very satisfactory.The paper researches and analyses Hadoop small files issues in depth, and summarizes the directions of small files optimization. Then based on existing research, the paper optimizes small file merging algorithm, reading algorithm and input format, and they have achieved good results.Throughout the full text, the paper does the following work in hadoop small files optimization:(1) By reading the literature, the paper summarizes two directions in the optimization of small files problem:the file merging optimization and the optimization of input formats related to data splits. File merging optimization can solve the problem of storage of massive small files to reduce the occupied memory space of NameNode. Optimized input format can reduce task processing time to improve the data processing efficiency.(2) Through the analysis of experimental data, it proves that small files merging optimization is an effective direction; and through experimental analysis, it shows that CombinedFileInputFormat is the best processing methods in the common four small files processing methods, and it can reduce data processing time of MapReduce job, and improve the data processing efficiency.(3) Nomal merging algorithm may become ordinary merge algorithm in special scenes, which can affect the reading performance of the file. To prevent this situation, this paper presents the merging algorithm N-Combiner related to parameter N, which not only takes into account the impact of the number of files on system storage performance, also taking into account the performance of the file to read. Take the N equal to 90, its performance is better than other mergeing algorithm.(4) In order to improve the reading efficiency of data processed by N-Combiner, the paper proposed Prefetching-Read algorithm which related to improved FIFO algorithm. It not only improves the reading performance by caching strategy, but also takes into account the impact of the cache size on system performance. Experimental results show that it can improve the efficiency of reading data orderly than none prefetching strategy, meanwhile it is better than the performance of full cache strategy.(5) Fragmentation strategy of different input formats are different, and it will affect the size of data included in the input split, which affects the number of Map tasks of job, and ultimately makes the required processing time of jobs very different. CombinedFileInputFormat is relatively good, and aiming at its various parameters optimization, combined with the use of N-Combiner algorithm, the paper proposed PCFIF input format. Its fragmentation strategy is same as CombinedFileInputFormat, but its some parameters is optimized and data is preprocessed by N-Combiner, which makes PCFIF is better than CombinedFileInputFormat on data processing efficiency.In summary, this paper has done a lot of research to optimize small file problems, and it has a very high reference value.

Keywords/Search Tags:

Hadoop, small files processing, Hadoop Distributed File System(HDFS), MapReduce, big data

PDF Full Text Request

Related items

1	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model
2	Research And Optimization Of Hadoop Small File Processing Technology
3	Research And Implementation Of Small File Processing Techniques In Hadoop
4	Optimization Study On Storing Massive Small Files Based On Hadoop
5	Optimization Scheme Of Small File Processing Based On HDFS
6	Design And Implementation Of Small File Processing And Algorithm Parallelization Based On Hadoop
7	Research And Optimization Of Processing Performance Of The Numerous Small Files Based On Hadoop
8	Research And Optimization Of Small Files Processing Techniques In Hadoop
9	Research On Access Strategy Of Massive Small Files Based On Hadoop
10	A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems