Font Size: a A A

Research Of Small Files Storage Method Based On HDFS

Posted on:2014-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q W DongFull Text:PDF
GTID:2248330398452534Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology, digital information is showing explosive growth, the traditional way has already not be able to meet the demand of storing massive data. So, it becomes an urgent problem to store and process massive data efficiently. At present, many large enterprises use the Hadoop HDFS (Hadoop Distributed File System) to store massive data. HDFS is designed to store large files with good reliability and scalability at the first. But with the development of Internet, people start to apply the HDFS to store small files and the existing shortcomings and deficiencies in it are exposed. The storage of small files has already been a bottleneck and restricted the overall performance of HDFS.This paper mainly studies the problem of small files stored in HDFS. For problem of processing work before small files stored into HDFS and retrieval work after storage, proposed three algorithms. Firstly, we will introduce Small Files Merging Algorithm based on Feature type and Sequence table. This algorithm is based on getting the characteristics of small files and the data types that the characteristics belongs to, merge the small files in the flow-through way, and create an index file based on the file name by NameNode unified management. Secondly, we present DataNode Pre-Allocation Algorithm based on Data Feature. The purpose of the Algorithm is to improve the efficiency ofNameNode and to reduce the entire performance impact of HDFS due to the overloading of NameNode. Thirdly, we propose Small Files Retrieval Algorithm based on Frequency of Access. The algorithm can be achieved to quickly find small files that we need from mass index files. It draws the ideas of virtual storage and page replacement. When users search, the index files are loaded to the virtual memory and replaced according to the index file access frequency.By this way. we can quickly hit the index file we want.We apply three different use cases which are designed by adjusting percent of small files and threshold in algorithms to test the system performance of three algorithms we present above. Experimental results show that the three algorithms can effectively improve the efficiency of HDFS for small files to store and read, and optimize the storage performance of entire HDFS.
Keywords/Search Tags:HDFS, Data Feature, Small files, Storage
PDF Full Text Request
Related items