Font Size: a A A

Research On Storage Strategy Of Massive Small Files Based On HDFS

Posted on:2018-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:S K XuFull Text:PDF
GTID:2348330563452692Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,driven by the rapid development of mobile Internet,Internet of things,cloud computing technology,the collected data showed a blowout growth,data types are numerous,the volume changes,the flow speed fast,and with the accumulation of time will produce thousands,trillion level of small files,how to optimize the storage of these small files become the current academic and industry recognized problems.As a widely recognized Hadoop distributed system infrastructure,its distributed storage system HDFS(Hadoop Distributed File System)has become the first choice for massive file storage,which uses NameNode and DataNode for file management and storage.However,HDFS is designed to solve the problem of mass flow file storage;it is not suitable for data storage of large-scale small-volume files.Therefore,this topic is based on HDFS platform,from the optimization strategy and access strategy of large number of small files,Focus on solving the current NameNode design of small files stored in memory loss is too large and the problem of low efficiency of reading small files in HDFS.The main results of this paper are as follows:(1)When a large-scale small volume file is stored on HDFS,the NameNode in memory generates a metadata file for each small file.The more the number of small files,the more the amount of metadata,so NameNode memory loss is greater.To this end,this paper designed a small file upload processing module.It mainly consists of four functional units: Firstly,through the determine unit to filter the file under the directory to find files that meet the characteristics of small files;Then,the small files are processed by the file processing unit,and the small files with relevant characteristics are classified;Finally,all kinds of small files are merged into large files by file merging unit.If you add a small file on the basis of the merged file,you can add the file by file appending unit.Through the file appending unit can reduce the number of merged files and metadata files,the file management more convenient.(2)When reading a large number of small files from HDFS,reading a small file will interact with the NameNode each time,and the small file on the DataNode position is more dispersion,and then read large-scale small file efficiency will be very low.In response to this problem,this article designed a small file reading method.In order to improve the efficiency of small file reading,the index table based on MYSQL Memory data engine is designed,while using the client cache and data node distributed independent cache to cache the required data information.And in order to solve the problem of cache hit rate,we use of the file prefetching mechanism to improve the cache hit rate.Through the experiment in the paper to verify the effectiveness of small file upload framework and read method,in experiment the file upload speed,Memory usage and file read efficiency were compared.The results show that the proposed scheme can alleviate the Memory pressure of NameNode nodes and effectively improve the speed of small file upload and read.
Keywords/Search Tags:Hadoop, HDFS, massive small file, file merge, cache mechanism
PDF Full Text Request
Related items