Font Size: a A A

The Optimization Method Research For Small File Data Storage Performance On Hadoop Distributed File System

Posted on:2018-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:X D SongFull Text:PDF
GTID:2348330512982139Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Big Data Time has arrived,and efficient data storage and reading has become a hot issue.Hadoop shows well performance on big data storage,but recently,with the widely use of social applications such as blogs,Wikipedia as well as space,the production of large numbers of small file data presents a great challenge to storage.The small file storage efficiency of Hadoop distributed file system,due to its single Namenode structure,is very low and easy to cause Namenode problem.This thesis put forward a novel solution on the Hadoop small files storage and tested its feasibility.This thesis is supported by the National Natural Science Fund Project(No.61271308,61172072,61401015),Subject construction project of graduate students of Beijing Municipal Education Commission and project of Chengdu Engineering Corporation limited of Power Construction Corporation of China,the main work is as follows:First of all,this thesis analyzes the characteristics as well as problems of Hadoop distributed file system:a single Namenode will produce a large number of metadata information when storing large amounts of small files,resulting in excessive memory consumption.Therefore we will use a solution by merging small files to large files.However,it is necessary to read a secondary index file to read corresponding small files after merging,which will influence the file reading efficiency.Therefore,by introducing a secondary index of metadata information,prefetching and caching mechanism,it will improve the reading efficiency.Through the above analysis,this thesis proposed an extended framework of Hadoop distributed file system structure,adding a data processing layer,processing small file merging,prefetching and caching,between the user layer and the data storage layer to improve the storage performance.In the extended framework of Hadoop distributed file system,it mainly used the following algorithms:One is the small files merging algorithm based on file type,which can reduce the Namenode memory consumption effectively by simple classification based on file type,the other is metadata secondary index merging algorithm based on file type,by raising the reading speed of mapping files of merged large file,it can improve the reading efficiency of system.Finally,the thermal storage algorithm based on dynamic frequency statistics is used,in which the highest frequency of the merging file is saved into prefetching and caching part.When the user requests to read the prefetching and cache file,the corresponding small files can be read directly without interacting with the Namenode,improving the reading efficiency of small files.At the end of this thesis,the Hadoop pseudo-distributed platform is built,and the results are shown by comparing the Namenode memory consumption,the efficiency of writing and reading files in original HDFS storage structure,HAR archive,and the improved HDFS storage structure respectively.The experimental results show that although the improved HDFS storage structure affect the efficiency of writing files in a certain extent,it reduce the Namenode memory consumption effectively,and improve the efficiency of reading small files at the same time,therefore showing better storage performance compared with the original small file storage solution.
Keywords/Search Tags:memory consumption, secondary index, small file merge, thermal storage
PDF Full Text Request
Related items