Font Size: a A A

Research And Optimization Of Processing Performance Of The Numerous Small Files Based On Hadoop

Posted on:2015-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhaoFull Text:PDF
GTID:2298330422990288Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the coming era of the big data, the internet data is increasing with eachpassing day. Hadoop, a cloud platform, comes into being to deal with the problem ofmassive amounts of data because of the traditional methods of data processing have notbeen adapted to the requirement of large data. Recently, the Hadoop has been applied tothe data analysis, data mining, machine learning and other scientific areas.However, the Hadoop platform was designed to handle the access pattern of streamof the big data from the beginning of design. In fact, there is a huge number of smallfiles in the existing systems on the fields of the energy, climate, astronomy, electroniccommerce and electronic library. According to a survey of2007, the national energyand scientific computing research center contains more than13million files, amongthem,99%of the files are under64MB,43%of the file size are under64KB. Becausethe Hadoop platform adopts a single node to manage the metadata of the distributed filesystem, the emergence of massive small files will bring huge memory pressure of themaster node and have great impact on the performance of the MapReduce model.To solve the problem of the small file which could not be handled efficiently bythe present Hadoop platform. A method based on least squares curve fitting to ensure“how small is small” is proposed. First and foremost, a criteria for quantifying theaccess time of the small file is defined; What’s more, the small file access time is usedto act as the impact factors of the problem to determine what is a small file; Finally, themeans based on the relevant knowledge of linear fitting is found by the experiment ofthe access time of the different datasets. In addition, this paper proposes a algorithm offile merge based on pretreatment in order to improve the processing performance of themassive small files. Then, the combined datasets and the original datasets are bothhandled with the program of WordCount which is combined with the algorithm of file merge. Meanwhile, each process-time of the datasets are recorded before and after themerge. The experimental results show that the combined methods of data processing isfaster than before the merge, which has greatly improved the efficiency of MapReduce.The small file processing algorithm of CIFSF is also studied and implemented, whoseprinciple is to combine the small file split and use the map task to handle the split. Atlast,the result based on the recording time of the processing procedure shows thatCIFSF is also an efficient algorithm to handle massive small files.
Keywords/Search Tags:Hadoop, Smallfile, MapReduce, Linear fitting, File merge, CIFSF
PDF Full Text Request
Related items