Font Size: a A A

Optimization Study On Storing Massive Small Files Based On Hadoop

Posted on:2017-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2428330488976113Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of cloud computing and big data in this era,there are countless data generated every day.A high proportion of them are small files.Many large companies use Hadoop for distributed computing and storing huge amounts of data.However,when massive small files exist in a cluster,the performance of Hadoop is very bad.In this paper,we focus on the issue of storing small files on Hadoop and propose a solution of combining Hbase,file merge and file index to process on different characteristics of small files.Then we design and implement a cache system based on multi-queue replacement algorithm.With the experimental verification,it can reduce the Namendoe memory usage and time-consuming to read and write a large number of small files.The storage performance has been optimized.Firstly,this paper analyzes the research status of mass small files storage in hadoop,then we introduce the Hadoop framework and conduct in-depth research on the working mechanism of Distributed File System HDFS,distributed computing framework MapReduce and the distributed database Hbase.And we laid special stress on analysing the HDFS architecture and the process of reading and writing files.We summed up the reasons of high Namenode memory footprint and low efficiency of reading and writing files while large amounts of small files are stored in Hadoop.Then,the paper analyzes the problems of storing small files and propose an overall design scheme of small files storage.We process the small files with different characteristics using different processing methods.In the light of high memory footprint of Namenode,this paper adopts the combined program based on file type characteristics,so that the number of documents greatly reduce and it lends to efficiently improve the file writing speed.For the issue of poor performance on reading small files,this article designs index for the small files based on the word search tree to ensure that small files can be retrieved completely and efficiently from a large file after the file merging.In order to improve the efficiency of reading files further,the cache based on the multi-queue replacement algorithm is designed to respond to the frequent reading request of the hot data to the disk in the Datanode.It can avoid a large number of hot data's frequent disk requests and poor reading performance of the cluster.Because the file index needs to be s stored in memory after file merging,if there are a large number of ultra-small files,the index file will be too large resulting in poor retrieval performance.In this paper,before the file is written to the cluster,a file identification will be in process.And ultra small files are stored in the designed Hbase table to make the ultra-small files stored conveniently and efficiently,and it can not only improve the retrieval performance of ultra-small files,but also avoid the Hbase inefficiency for large file processing.While ordinary small file is used for the processing flow of file merging,index and cache,so that the scheme of this paper can exhibit great performance under the scenarios of different document distributions.Finally,we set up a Hadoop cluster and compare the design of this paper to the original HDFS and other schemes.We analyze the Namenode memory footprint and small file reading and writing performance and so on with different schemes.And the experimental results show that the proposed scheme can greatly reduce the Namenode memory footprint and improve the performance of massive small file reading and writing in the cluster.
Keywords/Search Tags:Hadoop, Small file, HDFS, Hbase, File merging, Index, Cache, Cluster
PDF Full Text Request
Related items