Research On Small File Storage And Access Methods Based On Hadoop

Posted on:2024-09-11

Degree:Master

Type:Thesis

Country:China

Candidate:C K Zhang

Full Text:PDF

GTID:2568307184455894

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of informatization,the amount of data generated by user access to the Internet has increased rapidly in an explosive exponential form.Among them,the data volume of small files accounts for the vast majority of all data.Although Hadoop has been widely used in large file processing due to its outstanding performance,with the rapid increase in the number of small files,storing small files has brought about problems such as a large consumption of Name Node space,uneven disk loads,and degraded access performance.How to efficiently store and access these small files has become an urgent problem and research direction.Therefore,this thesis aims to design an efficient solution to optimize the storage and access of small files in HDFS.Regarding the issue of small file storage,the correlation between files and the size distribution after file merging can affect the quality of file merge storage.This thesis proposes a merging and storage strategy based on file correlation and file volume.The strategy uses Map Reduce to parallelize file access records and introduces a grouping and statistical strategy to split frequent itemsets during the self-join operation phase,thereby improving the Apriori algorithm to reduce the number of traversals of transaction databases and increase the efficiency of file correlation analysis.To ensure the effectiveness of merging,small files are first merged according to strong correlation.Additionally,to better utilize storage space,this thesis also designs a volume-based secondary merging algorithm.In cases where the size of the merged files does not reach the data block threshold,the algorithm performs a secondary merging based on volume,ensuring that the merged files are evenly distributed and filled as much as possible to reduce Name Node memory consumption and improve small file access efficiency.Regarding the issue of frequent interactions between the client and Name Node as well as node overload in small file access,this thesis proposes a correlation-based prefetching mechanism that utilizes the correlation between small files for prefetching.Furthermore,the thesis introduces an LRU-K cache replacement algorithm that combines the advantages of LRU and LFU cache replacement algorithms,effectively reducing cache pollution and caching files with high access hit rates,thereby improving the efficiency of small file access.The experimental results show that the proposed approach in this thesis can effectively improve the utilization of data blocks and reduce Name Node memory consumption in the file storage phase.In the file access phase,this approach can reduce the interaction frequency between the client and Name Node,thereby effectively reducing user access time.This further confirms the feasibility and effectiveness of the proposed approach in this thesis.

Keywords/Search Tags:

HDFS, Apriori algorithm, Small file merging, Caching mechanism

PDF Full Text Request

Related items

1	Research On HDFS Small File Access Method Based On Frequent Item Set
2	Research On Small File Access Technology Based On Hadoop
3	The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage
4	Research And Application Of The Optimization Strategy Of File Storage And Reading Based On HDFS
5	Research And Implementation Of Small File Storage Model Based On HDFS
6	Research On Mass Small File Storage Technology Based On Hadoop
7	Research On Optimization Strategy Of Small File Access Based On Hadoop
8	Optimization Study On Storing Massive Small Files Based On Hadoop
9	Research On Storage Strategy Of Massive Small Files Based On HDFS
10	Research And Optimization Of Massive Small File Storage Based On Hadoop Cluster