Research And Design Of Massive Small Files Merging Based On Hadoop

Posted on:2019-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:J F Peng

Full Text:PDF

GTID:2428330548985044

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Hadoop Distributed File System(HDFS),is the flagship file system of Hadoop,which is designed to reliably storage and manage large-scale files.Generally speaking,HDFS is a efficient way to store TB(Terabyte)or PB(Petabyte)level data.However,when HDFS stores large amount of small files,whose size is significantly smaller than the HDFS block size,the Small Files Problem will occur,which will jeopardize the namenode's main memory.The reason is that namenode's main memory stores a large amount of metadata and data block list information.HDFS suffers the penalty of performance with increased number of small files.It imposes a heavy burden to the namenode to store and manage a mass of small files.In order to improve the efficiency of storing and accessing the small files on HDFS,we propose Small Hadoop Distributed File System(SHDFS),which bases on original HDFS.Compared to original HDFS,we add two novel modules in the proposed SHDFS:merging module and caching module.In merging module,the correlated files model is proposed,which is used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files.In caching module,we use Log-linear model to dig out some hot-spot data that user frequently access to,and then design a special memory subsystem to cache these hot-spot data.Caching mechanism speeds up access to hot-spot data.This module aims to decrease on the number of interaction between HDFS client and namenode,reduce the pressure of namenode's main memory,and improve the efficiency of reading files.The experimental results show that storing an equal amount of small files,the namenode's main memory of SHDFS is 15% less than that of the original HDFS.In hot mode,both random reading single file and sequential reading multiple files,SHDFS's reading files is more efficient than the original HDFS.We put forward the optimization scheme,compared with original HDFS and SHDFS found that the namenode high memory consumption have markedly improved,proved the feasibility and effectiveness of optimization of small files proposed in this thesis.

Keywords/Search Tags:

HDFS, SHDFS, massive small files, merge, cache

PDF Full Text Request

Related items

1	Research On Storage Strategy Of Massive Small Files Based On HDFS
2	Research And Optimization Of Mass Small Files Based On HDFS
3	Research And Implementation Of Mass Small File Storage System Based On HDFS
4	The Research On Storage Of Massive Small Air Cargo Files Based On Hadoop
5	The Optimization Technology And Application Of Massive Small File Access Based On HDFS
6	Research And Optimization Of Storage Performance Of Massive Small Files In Cloud Environment
7	The Research And Implementation Of Method Regarding To The Small Files Problem Of Hadoop
8	Research And Optimization Of The Distributed Storage On HDFS
9	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
10	Optimization Of Small Files Accessed Base On MapFile In HDFS