Font Size: a A A

Research On Mass Small File Storage Technology Based On Hadoop

Posted on:2018-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:L XieFull Text:PDF
GTID:2428330548480458Subject:Engineering
Abstract/Summary:PDF Full Text Request
Hadoop Distributed File System(HDFS)is the "data warehouse" of Hadoop,it was originally designed to store large files,and the efficiency of storing small files on HDFS is low.HDFS has a master/slave architecture,the main reason why HDFS has a low efficiency in storing small files is that many small files seriously consume Master's resources,increase the workload of Master,and increase network load.HDFS itself lacks small file IO optimization strategies,therefore,HDFS small file storage problem is becoming one of the hot topics in the field of large data.Hadoop handles different kinds of files,for unstructured,irregular,ordinary,universal small files storage solutions are still a promising research topic.In this paper,a general merging approach is proposed for unstructured,irregular and general small files.And this paper focuses on the design and implementation of HDFS client local cache based on LRFU cache replacement policy.The access efficiency of small files have been improved by this approach.Some achievements are summarized as follows:This paper makes use of FP growth by the files order in the Web log to mines the associations between small files,and merges the associated files between the trigger file and the trigger file into the same file,this is the first merger.The second merge is based on a uniform distribution to recombine the first merged smaller size file and index the merged file.HDFS is a distributed file system based on Master/Slave master slave architecture.In order to solve the problem that serious consumption of resources on NameNode caused by storing large amounts of small files on HDFS,an idea that HDFS client acts as NameNode in some function is proposed in this paper,.Based on this idea,the local cache of block file information is designed and implemented on HDFS client.Users can obtain file block information locally while accessing files without requiring NameNode,and NameNode resource consumption has been reduced.Experiments show that hit rate of HDFS Client local cache mechanism can be maintained at more than 50%,small file access speed increased by 3.1 times,small file meta information requests on NameNode decreased by 28 times.
Keywords/Search Tags:Hadoop, HDFS, Apriori Algorithm, Double Merger, Index Mechanism, LRFU
PDF Full Text Request
Related items