Font Size: a A A

Research On Small File Storage And Access Methods Based On Hadoop

Posted on:2024-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:C K ZhangFull Text:PDF
GTID:2568307184455894Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of informatization,the amount of data generated by user access to the Internet has increased rapidly in an explosive exponential form.Among them,the data volume of small files accounts for the vast majority of all data.Although Hadoop has been widely used in large file processing due to its outstanding performance,with the rapid increase in the number of small files,storing small files has brought about problems such as a large consumption of Name Node space,uneven disk loads,and degraded access performance.How to efficiently store and access these small files has become an urgent problem and research direction.Therefore,this thesis aims to design an efficient solution to optimize the storage and access of small files in HDFS.Regarding the issue of small file storage,the correlation between files and the size distribution after file merging can affect the quality of file merge storage.This thesis proposes a merging and storage strategy based on file correlation and file volume.The strategy uses Map Reduce to parallelize file access records and introduces a grouping and statistical strategy to split frequent itemsets during the self-join operation phase,thereby improving the Apriori algorithm to reduce the number of traversals of transaction databases and increase the efficiency of file correlation analysis.To ensure the effectiveness of merging,small files are first merged according to strong correlation.Additionally,to better utilize storage space,this thesis also designs a volume-based secondary merging algorithm.In cases where the size of the merged files does not reach the data block threshold,the algorithm performs a secondary merging based on volume,ensuring that the merged files are evenly distributed and filled as much as possible to reduce Name Node memory consumption and improve small file access efficiency.Regarding the issue of frequent interactions between the client and Name Node as well as node overload in small file access,this thesis proposes a correlation-based prefetching mechanism that utilizes the correlation between small files for prefetching.Furthermore,the thesis introduces an LRU-K cache replacement algorithm that combines the advantages of LRU and LFU cache replacement algorithms,effectively reducing cache pollution and caching files with high access hit rates,thereby improving the efficiency of small file access.The experimental results show that the proposed approach in this thesis can effectively improve the utilization of data blocks and reduce Name Node memory consumption in the file storage phase.In the file access phase,this approach can reduce the interaction frequency between the client and Name Node,thereby effectively reducing user access time.This further confirms the feasibility and effectiveness of the proposed approach in this thesis.
Keywords/Search Tags:HDFS, Apriori algorithm, Small file merging, Caching mechanism
PDF Full Text Request
Related items