The Design And Implementation Of Massive Small Files Storage System Based On HDFS

Posted on:2013-03-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Xu

Full Text:PDF

GTID:2298330422974106

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Recently, enterprise and personal data have grown explosively. Every two daysnow we create as much information as we did from the dawn of civilization up until2003, according to Googleâ€™s CEO Eric Schmidt. How to store massive amount of data isa big problem faced by current storage system. Traditional centralized storage canâ€™tmeet the requirements of massive storage, therefore a lot of distributed file systemsemerged for large-scale data storage system, such as the Google File System(GFS),theHadoop File System(HDFS), PVFS, Luster, etc.These distributed file systems have good scalability and fault-tolerancecharacteristics. They can meet the needs of massive big files storage. However, manyapplications need to store not only massive big files, but also massive small files. Thedistributed file systems like GFS, HDFS have capability to meet the efficient storage ofbig files. When they store large amounts of small files, the systems are very inefficient.To address this problem, industry and academia proposed many methods. But thesemethods have some problems such as low performance, low system reliability, Storingmetadata inefficiently. In response to these challenges, we designed and implemented amassive small files system based on HDFS.The main idea of the system is that small files in a same directory of HDFS aremerged into one big file. The file is called a small file data file. Small file index thatrecord the position of the small file in the corresponding data file is generated at thesame time.we proposed the massive small files storage system. The system is scalable,fault-tolerance, distributed mass storage cluster of small files. By this way, the smallfiles can be stored distributed and fault-tolerance. The system also put small files indexto each datanode in order to solve the shortcoming of a single namenode bottleneck.The system can reduce the risk of loss of small files by index fault tolerance mechanism.At the same time, the system creates multiple data files in a single directory to solve theproblem of conflict of small files when client accesses the same directory. On the basisof the above, the client caches the information user commonly used to improve theefficiency of the system file access.The experiments show that the system is much better than HDFS in write latencyand throughput. Further, the system has capability to solve the problem that the metadata of massive small files in HDFS is too large. The system also improves thereliability for small files through the mechanism of index fault-tolerance.

Keywords/Search Tags:

Massive Small Files, Distributed File System, Distributed Index, Fault-Tolerant, Hadoop File System

PDF Full Text Request

Related items

1	Research And Design Of High Performance Distributed File System For Small File
2	Research And Implement Of Distributed Massive Small File Storage Access Optimization
3	Key Technology Research And System Implementation Of Distributed File System Adapted To Massive Small Files
4	Research And Implementation Of Fast Retrieval Technology For Massive Small Files
5	Design And Implementation Of A Distributed Storage Of Small Files Performance Optimization Strategies
6	Research On Performance Optimization For Massive Small Files In Distributed File System
7	Optimization Study On Storing Massive Small Files Based On Hadoop
8	Research And Implement Of FastDFS-based Efficient Access Method For Massive Small Files
9	Research And Optimization Of Massive Small File Storage Based On Hadoop Cluster
10	The Design And Realization Of A Distribute System To Store Lots Of Small Files