Font Size: a A A

Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model

Posted on:2013-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:R C CaiFull Text:PDF
GTID:2248330371983043Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the explosive growth of the data, traditionaltechnical architecture has been increasingly unable to meet the needs of the current massivedata. Therefore, research of massive data processing and storage has become upsurge. DongCutting, who learn from Google paper developed a new distributed computingplatform-Hadoop to complete the calculation of index of search engine. Hadoop is designed tohandle the streaming of large files, but Hadoop is used to do a lot of computation in variousfields with the continuous promotion of the Hadoop application, this leads to the requirementsexpanded. And small files processing become a bottleneck of Hadoop now.This paper proposes a new method of dealing with small files in the Hadoop by theresearch of recent solution. The small file is referring to the size of file that is less than thesize of HDFS block (64M). This kind of file has affected the performance and expansibility ofHadoop. For the following reasons:firstly,any form of the block,file or directory is storedas objects in memory, and each object is about to spend150byte. If there are10million smallfiles, the namenode need the2G memory space, if you store up to100million files, thenamenode need20G space. The small file takes up a lot of namenode’s memory space, so thatthe namenode’s memory capacity would severely hinder the expansion and application of thecluster; secondly, the speed of accessing to a large number of small files is far less thanaccessing to a few big files. HDFS was originally develop for stream access to large files, ifaccess to a large number of small files, the access need to constantly jump from one datanodeto another the datanode, it seriously affects the performance; lastly, the speed of dealing withlarge numbers of small files is less than the speed of the equal size of large files, because eachsmall files take up a slot in mapreduce, and starting a task will spend a lot of time, a largenumber of small files will inevitably lead to most of the time spent on the start and release oftask.This paper builds a new top-level file system which named HSF based on HDFS to solvethe storage management of small files. In HSF file system, the files are classified, anddifferent kinds of files take a different solution. The small files which are the same as picturefiles, adopt SequenceFile as a container to merge small files, and building efficient indexmechanism complete random access to original small files, and then the small files problem toHadoop is resolved. In the experiment part of the article, we test the HSF file system by use of different kindsof data and establish different kinds of experimental cases. The experiment of reading smallfiles and the merged file, binary image files and text files separate experiments, and readingefficiency is linear growth for the local file system and HDFS, so it does not affect the normaloperation of system because of the amount of increasing data; we cmpare the the small filesand the merged file with the WordCount MapReduce sample program and using the text file,this experiment fully verify the HSF file system applies to MapReduce computation model; atlast, the experiment of random small file read, it verifies that the file random access of HSF ismore efficient than the har file system.
Keywords/Search Tags:Hadoop, HDFS, small files, SequenceFile, Distributed
PDF Full Text Request
Related items