Research And Optimization Of Processing Performance Of The Numerous Small Files Based On Hadoop

Posted on:2015-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Zhao

Full Text:PDF

GTID:2298330422990288

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the coming era of the big data, the internet data is increasing with eachpassing day. Hadoop, a cloud platform, comes into being to deal with the problem ofmassive amounts of data because of the traditional methods of data processing have notbeen adapted to the requirement of large data. Recently, the Hadoop has been applied tothe data analysis, data mining, machine learning and other scientific areas.However, the Hadoop platform was designed to handle the access pattern of streamof the big data from the beginning of design. In fact, there is a huge number of smallfiles in the existing systems on the fields of the energy, climate, astronomy, electroniccommerce and electronic library. According to a survey of2007, the national energyand scientific computing research center contains more than13million files, amongthem,99%of the files are under64MB,43%of the file size are under64KB. Becausethe Hadoop platform adopts a single node to manage the metadata of the distributed filesystem, the emergence of massive small files will bring huge memory pressure of themaster node and have great impact on the performance of the MapReduce model.To solve the problem of the small file which could not be handled efficiently bythe present Hadoop platform. A method based on least squares curve fitting to ensure“how small is small” is proposed. First and foremost, a criteria for quantifying theaccess time of the small file is defined; What’s more, the small file access time is usedto act as the impact factors of the problem to determine what is a small file; Finally, themeans based on the relevant knowledge of linear fitting is found by the experiment ofthe access time of the different datasets. In addition, this paper proposes a algorithm offile merge based on pretreatment in order to improve the processing performance of themassive small files. Then, the combined datasets and the original datasets are bothhandled with the program of WordCount which is combined with the algorithm of file merge. Meanwhile, each process-time of the datasets are recorded before and after themerge. The experimental results show that the combined methods of data processing isfaster than before the merge, which has greatly improved the efficiency of MapReduce.The small file processing algorithm of CIFSF is also studied and implemented, whoseprinciple is to combine the small file split and use the map task to handle the split. Atlast，the result based on the recording time of the processing procedure shows thatCIFSF is also an efficient algorithm to handle massive small files.

Keywords/Search Tags:

Hadoop, Smallfile, MapReduce, Linear fitting, File merge, CIFSF

PDF Full Text Request

Related items

1	Research On Optimization Method Of File Access Based On Hadoop
2	Research On Small File Storage Mechanism For Hadoop
3	The Research And Analysis Of Hadoop Small File Processing Method
4	Research And Optimization Of Hadoop Small File Processing Technology
5	The Optimization Method Research For Small File Data Storage Performance On Hadoop Distributed File System
6	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
7	Research And Implementation Of Small File Processing Techniques In Hadoop
8	Design And Implementation Of Small File Processing And Algorithm Parallelization Based On Hadoop
9	Research On Storage Strategy Of Massive Small Files Based On HDFS
10	Research On The Performance And Optimization Of MapReduce Model In Hadoop Platform