Font Size: a A A

Research On Small File Storage Mechanism For Hadoop

Posted on:2019-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2348330545958453Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hadoop is the current popular big data processing platform.Because of its high scalability,high reliability and other advantages,it has been widely used in the industry.Hadoop core component Hadoop Distributed File System(HDFS)can store large files efficiently.However,when HDFS stores a large amount of files which are much smaller than the size of the block,it will cause small file problems because of the HDFS access characteristics and metadata management mode.The problem is:(1)the client needs frequent jump data node access to small files,the file read and write performance is poor;(2)when the cluster is running,the NameNode loads all the metadata into memory management,but the limited memory is difficult to manage the massive small file metadata;(3)when the NameNode started,it takes a long time to load massive metadata,which causes the cluster time to be unavailable.With the increasing number of images,logs and other small files in the internet,Hadoop often has to store or deal with a large number of small files.How to solve the problem of Hadoop small files has received continuous attention from academic and industry.The existing research results mainly by aggregating small files into large files,reducing the number of files and metadata,so as to alleviate the memory pressure of NameNode and achieve the goal of storing large number of small files.However,the existing schemes have such problems as small file straddle and block space waste,and because of the sacrifice of small file metadata,HDFS cannot directly execute directory command,access control and other file system management operations.This paper focuses on the small file access and small file metadata management research to solve the small file problem of the Hadoop platform.For the problem of poor read and write performance of large amount of small files,this paper proposes a large amount of small file multi-level optimization storage method based on file merging and prefetching cache.This method first uses the equilibrium merge queue algorithm,merges small files into large files of block volume,makes full use of block space and avoids the small file being stored in block,reduces the node jump in storage file,and improves the write speed.Secondly,through the bloom filter and file mapping index,combined with the prefetching cache strategy,shorten the read file query and transmission time,and improve the read speed.Simulation experiments show that the method can provide efficient small file read and write performance.For the problem that the NameNode is difficult to manage the massive small file metadata,this paper proposes a metadata management method based on log merge tree and flat directory,which makes up the deficiency of the existing scheme in the small file metadata management.First,a metadata storage component based on log merging tree and memory mapping file is designed.The location of the NameNode management metadata is transferred from memory to disk,and the scale of metadata management is improved.Second,the directory and metadata are flattened,and the NameNode does not need to rebuild the tree directory and shorten the time of loading and accessing metadata.The simulation results show that the method can achieve excellent metadata operation performance,and can achieve several times the metadata management scale of original HDFS.
Keywords/Search Tags:small file problem, hadoop distributed file system, merge algorithm, storage optimization, metadata management
PDF Full Text Request
Related items