Font Size: a A A

Research And Optimization Of Small Files Processing Techniques In Hadoop

Posted on:2017-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2348330503964617Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the traditional storage method has been unable to meet the current needs of massive data access,storage and processing of massive data has become a new topic of current research. The Hadoop distributed computing platform with high reliability, easy extension, high fault tolerance, etc, have been widely used in the field of cloud computing.As Hadoop process files by means of the streaming data access pattern,that is to say it's designed to store large files. Therefore, Hadoop give a good performance to process the big files, and happen to the problem of low efficiency of storage to process small files. Aim at the problem,the paper analyzes some research and improvement scheme from the predecessors,finding out the advantages and disadvantages by studying the other solutions,also made the corresponding improvement on the basis of the other solutions. The design scheme of the paper add a separate Processing Module of small files to the distributed file system,by means of a small files processing module to merge small files, then uploaded to the HDFS to set up index of files,read and write data after set the files prefetch cache in the small files processing.This architecture makes the HDFS system in processing small file does not affect to the large file or at the same time to write or read for merge small, so as to improve the efficiency of the storage access system.The scheme of the merging and indexing for Small file is improved on the basis of HAR in the paper. By creating a small file period of time to named for the merge files. which combine small files into a large file.In addition, according to the small file name and extension, this paper builds the Trie tree index for the small file to a concrete block and bloke address information, and shards the index with the extension to build a Two Level Index Mechanism, and this Two Level Index is placed in the Small File Processing Module to speed up small file retrieval performance of the HDFS. By means of place the files in the buffer pool of the Small File Processing Module for prefetching of local index and related files. The paper gives the specific implementation of this optimization scheme on the Hadoop cluster, including the related algorithm implementation of merging small files, custom input shard on MapReduce, double index, etc. In addition, establish a performance indicator to the evaluation analysis of small file memory usage efficiency and access efficiency.Comparing the small files scheme of processing optimization with HAR and the original HDFS to the performance. The experimental results show that the optimization scheme of proposed for small file processing is better than the original HDFS solutions and HAR in memory usage efficiency and access efficiency.
Keywords/Search Tags:Hadoop, HDFS, Small Files, Merge, Index
PDF Full Text Request
Related items