Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model

Posted on:2013-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:R C Cai

Full Text:PDF

GTID:2248330371983043

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and the explosive growth of the data, traditionaltechnical architecture has been increasingly unable to meet the needs of the current massivedata. Therefore, research of massive data processing and storage has become upsurge. DongCutting, who learn from Google paper developed a new distributed computingplatform-Hadoop to complete the calculation of index of search engine. Hadoop is designed tohandle the streaming of large files, but Hadoop is used to do a lot of computation in variousfields with the continuous promotion of the Hadoop application, this leads to the requirementsexpanded. And small files processing become a bottleneck of Hadoop now.This paper proposes a new method of dealing with small files in the Hadoop by theresearch of recent solution. The small file is referring to the size of file that is less than thesize of HDFS block (64M). This kind of file has affected the performance and expansibility ofHadoop. For the following reasons：firstly，any form of the block，file or directory is storedas objects in memory, and each object is about to spend150byte. If there are10million smallfiles, the namenode need the2G memory space, if you store up to100million files, thenamenode need20G space. The small file takes up a lot of namenode’s memory space, so thatthe namenode’s memory capacity would severely hinder the expansion and application of thecluster; secondly, the speed of accessing to a large number of small files is far less thanaccessing to a few big files. HDFS was originally develop for stream access to large files, ifaccess to a large number of small files, the access need to constantly jump from one datanodeto another the datanode, it seriously affects the performance; lastly, the speed of dealing withlarge numbers of small files is less than the speed of the equal size of large files, because eachsmall files take up a slot in mapreduce, and starting a task will spend a lot of time, a largenumber of small files will inevitably lead to most of the time spent on the start and release oftask.This paper builds a new top-level file system which named HSF based on HDFS to solvethe storage management of small files. In HSF file system, the files are classified, anddifferent kinds of files take a different solution. The small files which are the same as picturefiles, adopt SequenceFile as a container to merge small files, and building efficient indexmechanism complete random access to original small files, and then the small files problem toHadoop is resolved. In the experiment part of the article, we test the HSF file system by use of different kindsof data and establish different kinds of experimental cases. The experiment of reading smallfiles and the merged file, binary image files and text files separate experiments, and readingefficiency is linear growth for the local file system and HDFS, so it does not affect the normaloperation of system because of the amount of increasing data; we cmpare the the small filesand the merged file with the WordCount MapReduce sample program and using the text file,this experiment fully verify the HSF file system applies to MapReduce computation model; atlast, the experiment of random small file read, it verifies that the file random access of HSF ismore efficient than the har file system.

Keywords/Search Tags:

Hadoop, HDFS, small files, SequenceFile, Distributed

PDF Full Text Request

Related items

1	Research Of Improving Storage Of Replica And Small Files Merging And Access Optimization On Hadoop Platform
2	A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems
3	The Research And Improvement For General Distributed File System
4	Research On Access Optimization Of Small Files In Hadoop Cluster
5	Research And Optimization Of Small Files Processing Techniques In Hadoop
6	The Research Of Increase The IO Speed Of Small Files In HDFS
7	Study On The Optimization Method Of Massive Medical Image Data Processing Based On Hadoop
8	The Research And Analysis Of Hadoop Small File Processing Method
9	Research And Implementation Of Small Files Storage Management Based On Hadoop
10	Optimization Scheme Of Small File Processing Based On HDFS