Font Size: a A A

The Design And Implementation Of Massive Small Files Storage System Based On HDFS

Posted on:2013-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:2298330422974106Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently, enterprise and personal data have grown explosively. Every two daysnow we create as much information as we did from the dawn of civilization up until2003, according to Google’s CEO Eric Schmidt. How to store massive amount of data isa big problem faced by current storage system. Traditional centralized storage can’tmeet the requirements of massive storage, therefore a lot of distributed file systemsemerged for large-scale data storage system, such as the Google File System(GFS),theHadoop File System(HDFS), PVFS, Luster, etc.These distributed file systems have good scalability and fault-tolerancecharacteristics. They can meet the needs of massive big files storage. However, manyapplications need to store not only massive big files, but also massive small files. Thedistributed file systems like GFS, HDFS have capability to meet the efficient storage ofbig files. When they store large amounts of small files, the systems are very inefficient.To address this problem, industry and academia proposed many methods. But thesemethods have some problems such as low performance, low system reliability, Storingmetadata inefficiently. In response to these challenges, we designed and implemented amassive small files system based on HDFS.The main idea of the system is that small files in a same directory of HDFS aremerged into one big file. The file is called a small file data file. Small file index thatrecord the position of the small file in the corresponding data file is generated at thesame time.we proposed the massive small files storage system. The system is scalable,fault-tolerance, distributed mass storage cluster of small files. By this way, the smallfiles can be stored distributed and fault-tolerance. The system also put small files indexto each datanode in order to solve the shortcoming of a single namenode bottleneck.The system can reduce the risk of loss of small files by index fault tolerance mechanism.At the same time, the system creates multiple data files in a single directory to solve theproblem of conflict of small files when client accesses the same directory. On the basisof the above, the client caches the information user commonly used to improve theefficiency of the system file access.The experiments show that the system is much better than HDFS in write latencyand throughput. Further, the system has capability to solve the problem that the metadata of massive small files in HDFS is too large. The system also improves thereliability for small files through the mechanism of index fault-tolerance.
Keywords/Search Tags:Massive Small Files, Distributed File System, Distributed Index, Fault-Tolerant, Hadoop File System
PDF Full Text Request
Related items