The Research And Implementation Of Method Regarding To The Small Files Problem Of Hadoop

Posted on:2016-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y Dang

Full Text:PDF

GTID:2348330488474038

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Hadoop possesses powerful ability of storage and analysis for mass data as a developed distributed system and performs well in many applications.The HDFS(Hadoop Distributed File System) talked about in this paper is an essential part of hadoop. It is designed to store large files with access patterns of streaming data, which is suitable for large datasets analysis. However, large number of small files are also required to be coped with in many scenarios, and HDFS seems not to be very good at it. Flaws show their faces when the number of small files grows significantly. To be specific, the memory of namenode is consumed dramatically and thus become the bottleneck of the whole system. Moreover, the efficiency of accessing small files drops a lot; And resources of mapreduce are wasted at the same time.By due analysis, the key solutions to these problems are revealed and as follows: the total number of files has to be reduced, and so do the times of communication between clients and namenode during accessing files.Following this route, small files are merged into big ones, which wiil be stored into the system as a whole, to make the file information maintained by namenode shrink, and thus alleviate the memory pressure. Then taking advantages, such as fast searching and elements being stored in order, of bplustree, an index based on bplustree is built to record the relationship between small files and merged ones. And some alternations have been done to the bplustree to help client get the index information of file-path-related and upload-timerelated files when accessing a file for the purpose that the frequency the index is read could decrease because there is no need to get index from namenode when it is already available locally. Furthermore, the data block which contains the required file is prefetched and index is constructed to maintain relevant information so that the communications could be saved when accessing files in the same block.Then, a hadoop cluster is built and the method mentioned above is put into use on it, and after which the performances regarding to memory and accessing efficiency are measured.As the experiments show, the efficiency of memory usage and files accessing improved noticeably by the above method compared with the original HDFS.

Keywords/Search Tags:

HDFS, small files, merge, store, index, bplustree

PDF Full Text Request

Related items

1	Research And Optimization Of Mass Small Files Based On HDFS
2	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
3	Research And Optimization Of Small Files Processing Techniques In Hadoop
4	Research And Application Of Small Files Storage Method Beased On HDFS
5	Research And Implementation Of Small Files Storage Management Based On Hadoop
6	Research And Optimization Of The Distributed Storage On HDFS
7	Research And Design Of Massive Small Files Merging Based On Hadoop
8	Research And Implementation Of Mass Small Files Storage Of Social Network Based On HDFS
9	Research On Storage Strategy Of Massive Small Files Based On HDFS
10	The Research And Implementation Of Mass Small File Storage System