Research On Mass Small File Storage Technology Based On Hadoop

Posted on:2018-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:L Xie

Full Text:PDF

GTID:2428330548480458

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Hadoop Distributed File System(HDFS)is the "data warehouse" of Hadoop,it was originally designed to store large files,and the efficiency of storing small files on HDFS is low.HDFS has a master/slave architecture,the main reason why HDFS has a low efficiency in storing small files is that many small files seriously consume Master's resources,increase the workload of Master,and increase network load.HDFS itself lacks small file IO optimization strategies,therefore,HDFS small file storage problem is becoming one of the hot topics in the field of large data.Hadoop handles different kinds of files,for unstructured,irregular,ordinary,universal small files storage solutions are still a promising research topic.In this paper,a general merging approach is proposed for unstructured,irregular and general small files.And this paper focuses on the design and implementation of HDFS client local cache based on LRFU cache replacement policy.The access efficiency of small files have been improved by this approach.Some achievements are summarized as follows:This paper makes use of FP growth by the files order in the Web log to mines the associations between small files,and merges the associated files between the trigger file and the trigger file into the same file,this is the first merger.The second merge is based on a uniform distribution to recombine the first merged smaller size file and index the merged file.HDFS is a distributed file system based on Master/Slave master slave architecture.In order to solve the problem that serious consumption of resources on NameNode caused by storing large amounts of small files on HDFS,an idea that HDFS client acts as NameNode in some function is proposed in this paper,.Based on this idea,the local cache of block file information is designed and implemented on HDFS client.Users can obtain file block information locally while accessing files without requiring NameNode,and NameNode resource consumption has been reduced.Experiments show that hit rate of HDFS Client local cache mechanism can be maintained at more than 50%,small file access speed increased by 3.1 times,small file meta information requests on NameNode decreased by 28 times.

Keywords/Search Tags:

Hadoop, HDFS, Apriori Algorithm, Double Merger, Index Mechanism, LRFU

PDF Full Text Request

Related items

1	The Improved Apriori Algorithm Based On Hadoop Calculation Model
2	Research And Improvement Of Apriori Algorithm Based On Hadoop
3	Research And Implementation Of Small File Processing Techniques In Hadoop
4	Research And Application Of Improved Apriori Algorithm On Hadoop
5	Research Of Data Mining Method For Public Buildings Energy Consumption Based On Hadoop
6	Research On Improvement Of Apriori Algorithm Based On Hadoop Platform
7	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
8	Research On Data Security Safeguard Mechanism Based On Hadoop
9	The Research Of LRFU And Its Adaptive Algorithm
10	Application Research Of Apriori Algorithm Based On Index Structure In CRM Of Foreign Trade