Font Size: a A A

Research And Design Of Massive Small Files Merging Based On Hadoop

Posted on:2019-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:J F PengFull Text:PDF
GTID:2428330548985044Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Hadoop Distributed File System(HDFS),is the flagship file system of Hadoop,which is designed to reliably storage and manage large-scale files.Generally speaking,HDFS is a efficient way to store TB(Terabyte)or PB(Petabyte)level data.However,when HDFS stores large amount of small files,whose size is significantly smaller than the HDFS block size,the Small Files Problem will occur,which will jeopardize the namenode's main memory.The reason is that namenode's main memory stores a large amount of metadata and data block list information.HDFS suffers the penalty of performance with increased number of small files.It imposes a heavy burden to the namenode to store and manage a mass of small files.In order to improve the efficiency of storing and accessing the small files on HDFS,we propose Small Hadoop Distributed File System(SHDFS),which bases on original HDFS.Compared to original HDFS,we add two novel modules in the proposed SHDFS:merging module and caching module.In merging module,the correlated files model is proposed,which is used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files.In caching module,we use Log-linear model to dig out some hot-spot data that user frequently access to,and then design a special memory subsystem to cache these hot-spot data.Caching mechanism speeds up access to hot-spot data.This module aims to decrease on the number of interaction between HDFS client and namenode,reduce the pressure of namenode's main memory,and improve the efficiency of reading files.The experimental results show that storing an equal amount of small files,the namenode's main memory of SHDFS is 15% less than that of the original HDFS.In hot mode,both random reading single file and sequential reading multiple files,SHDFS's reading files is more efficient than the original HDFS.We put forward the optimization scheme,compared with original HDFS and SHDFS found that the namenode high memory consumption have markedly improved,proved the feasibility and effectiveness of optimization of small files proposed in this thesis.
Keywords/Search Tags:HDFS, SHDFS, massive small files, merge, cache
PDF Full Text Request
Related items