Font Size: a A A

The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage

Posted on:2016-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:T LiFull Text:PDF
GTID:2298330452966405Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hadoop, an open-source software framework developed for reliable, scalable,distributed computing and storage, is successfully used by many companies includingYahoo, Amazon, Facebook, and New York Times. Hadoop distributed file system(HDFS), as the primary storage system of Hadoop, is a portable, high reliability, highthroughput, and open source distributed file system. It is primarily designed forstreaming access of big files. Large number of small files will occupy too muchmomeries. Reading through small files normally causes lots of seeks and lots ofhopping from one DataNode to another DataNode to retrieve each small file, all ofwhich is an inefficient data access pattern. So HDFS is not expert in managing losts ofsmall files.A middleware called HMFS is proposed in this paper to improve the efficiencyof storing and accessing small files on HDFS. It consists of user interface, tasks andbuffers. File operation interfaces to make it easier for software developers to submitdifferent file requests, all the tasks are running in the background. HMFS boosts thefile upload speed by using asynchronous write mechanism and the file downloadspeed by adopting prefetching and caching strategy.In order to improve the efficiency of file mergingand caching on HDFS, a newefficient approach Smart File System (SmartFS) is proposed in this paper. Byanalyzing the file accessing log to obtain the accessing behavior of users, SmartFSestablishes a probability model of file associations. This will be the reference ofmerging algorithm to merge the relevant small files into large files which will bestored on HDFS. When a file is accessed, SmartFS will prefetch the related filesaccording to the prefetching algorithm to accelerate the access speed. To guarantee theenough cache space, a cache replacement algorithm Prefetching-LFU is put forward.Finally, this paper has designed a small file system which based on HDFS to combine advantages of HMFS and SmartFS. The system apply HMFS to handleonline requests, such as the file upload, download, update, and delete requests. And itapply SmartFS to analyze the file accessing log to obtain the file associations andcombine the related files into big one and upload to HDFS. The system adopts theHMFS and SmartFS’s caching strategies to ensure the efficient operation of a varietyof situations.The experimental results show that the system can help to obtain high speed ofstorage and access for massive small files on HDFS. It can mege the related files andimprove the storing and accessing efficiency of small files on HDFS. Finally, thesystem is a common and high performance small file system based on HDFS.
Keywords/Search Tags:HDFS, Distributed file system, small files, file merging, prefetch andcache
PDF Full Text Request
Related items