Font Size: a A A

Research On Small File Access Technology Based On Hadoop

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:H M LiuFull Text:PDF
GTID:2518306470470194Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and the popularization of network,the data closely related to our life grows rapidly in the form of explosive index.Although the Hadoop distributed file system is popular in the distributed storage space,there are some bottlenecks in the access performance for handling large Numbers of small files:(1)Name Node memory pressure,load imbalance,memory space utilization is low;(2)The file access mechanism affects the file reading performance.In view of these problems,this paper makes research and technical improvement from two aspects: the storage performance of small files and the reading performance of small files.In the first aspect,the storage performance of small files,through data analysis to obtain the size of the file threshold and verify;Based on the historical access log of small files,the Aprior algorithm is used to preprocess the selected small files.After the analysis of the historical access log data of files,the correlation probability model is constructed,the relevance of files is calculated,a merging algorithm based on directed graph is designed,which effectively improves Name Node memory utilization.In the second aspect,the reading performance of small files,this paper changes the storage location of metadata,and a multi-level index mechanism of metadata based on merging time is established.The heat of the file is introduced into the cache,the LRU cache replacement strategy is improved,and the heat based LRU replacement strategy is proposed.Taking correlation as the relevant factor of file prefetching,a prefetching mechanism based on correlation is proposed,which effectively improves the efficiency of file access.Besides,based on the above research content,a general massive file storage system based on Hadoop was designed and implemented in this paper,and conducts performance tests on the pseudo-distributed platform with the original HDFS scheme,HAR scheme and MPM scheme.The consumption of metadata,the time of reading files and the time of uploading files were taken as the quantitative indexes of the experimental test respectively.The experimental results show that this scheme can meet the needs of alleviating the pressure of nodes and increasing the speed of file reading and uploading when accessing massive small files for distributed file storage systems,which provides technical support for the access problem of massive small files.
Keywords/Search Tags:Data storage, HDFS, Small file, Merging algorithm, Caching mechanisms
PDF Full Text Request
Related items