Font Size: a A A

The Optimization Technology And Application Of Massive Small File Access Based On HDFS

Posted on:2022-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2518306557467974Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Hadoop Distributed File System(HDFS)has been widely used in the field of big data storage due to its high reliability,easy expansion,and high fault tolerance.But in recent years,with the rise of a series of mobile applications,such as social networking,short video,and e-commerce,a large number of small files have been produced.However,HDFS was not originally designed to store a large number of small files.Because a large number of small files will generate a large amount of metadata information,which is stored in the Namenode memory,the namenode memory bottleneck problem will cause HDFS to store small files with low efficiency.This thesis focuses on the research of small file access in HDFS in the cloud environment,and mainly does the following three aspects:First of all,in view of the low access performance caused by the Namenode memory bottleneck when HDFS stores small files,a small file merging scheme suitable for the big data field is proposed.This solution first uses a density and hierarchy-based small file merging algorithm to cluster small files related to user access,and then stores the clustered small file metadata information into linked list group.When the amount of small file information store in a linked list in the linked list group reaches a certain size,the corresponding small files are taken out of HDFS,according to the metadata information,and then merged,so as to replace the original small files.In this way,the small files within the merged file have strong access relevance,which reduces the time consumption caused by reading across multiple merged files when the user accesses the small files in batch.At the same time,the merging method based on linked list groups avoids storage fragmentation.The experiment results show that the memory consumption of Namenode is greatly reduced in the proposed small file merging scheme.Besides,it reduces the time consumption of reading and writing due to merging file.Secondly,in order to further optimize the read performance of small files,a hybrid cache solution is proposed: an intermediate cache is added between the HDFS cluster and users.This solution divides the cache into a traditional cache area and a prediction cache area,and considers the cache replacement strategy and the prefetch strategy at the same time.The traditional cache area considers the cache replacement strategy and adopts an adaptive cache replacement strategy based on multilevel queue optimization.The prediction cache area considers the cache prefetch strategy and adopts a prefetch algorithm based on linear regression.Experiment results show that the hybrid caching solution improves the cache hit rate of users accessing small files,and ensures the stability of the cache hit rate when the application scenarios and user needs change.The reading time of small files is also reduced.At the same time,the hybrid cache solution shares the I/O overhead of the HDFS cluster and improves cluster stability.Finally,a prototype system is designed and implemented based on the two small file optimization schemes proposed above.A small file processing module is designed for the cold chain logistics security traceability application.This module exists between the HDFS cluster and the user,and is responsible for processing the user's reading and writing of small files and the user's interaction with the HDFS cluster.At the same time,it monitors the memory resources,traffic resources,and the hit rate of the cache in the module of the entire cluster.
Keywords/Search Tags:Small file, Merge, Cache, Prefetch, HDFS
PDF Full Text Request
Related items