Font Size: a A A

A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems

Posted on:2017-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:C X SunFull Text:PDF
GTID:2308330503485306Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The Hadoop platform has been widely used in distributed computing because of its reliability and scalability. One of the most important components of Hadoop is the Hadoop Distributed File System(HDFS), which is a very efficient way to store TB(Terabyte) or PB(Petabyte) level data. However, when HDFS stores large amount of small files, whose size is significantly smaller than the HDFS block size, the Problem of Small Files will occur, which will jeopardize the Namenode’s main memory and reduce the data transmission efficiency.Extended HDFS(EHDFS) has been proposed to solve the Small Files Problem. It is valid when a few files are stored in the system. But as long as the amount of files boosts, EHDFS no longer makes any difference.In our research, the basic approach is to analyze clients’ behavior to mine the association rules between files, and to combine the correlated files. This helps cutting metadata footprint in NameNode’s main memory. Further, providing prefetching mechanism will decrease preparation time due to requesting metadata, which enhances data transmission efficiency.We add the Behavior Analysis Unit(BAU) node to the HDFS first, so that our system consists of NameNode, DataNodes, and BAU. The DataNodes send the BAU a list of files that each client has requested by the recorded report. The BAU generates the complete record from all DataNodes in the system. Then, the data mines the association rules between the files and combines the related files.We then add a client cache in the client side. When the client requests our system for a file that is related to other files, a metadata of requested files as well as correlated files will be obtained from the Namenode and be saved in the client cache. Thus, when the client requests another file, if its metadata were in the client cache, the client could directly go to corresponding DataNodes for the file instead of asking NameNode for the metadata.We finally focus on how to stabilize the performance of HBAU by guaranteeing the safe data storing and reinforcing the efficiency in solving the small files problem to make the HBAU more practical rather than a theoretical distributed file system. A special writing procedure is designed and applied only on the BAU node when writing the combined files, which shortens the Stopping period. Moreover, the recommendation mechanism is introduced to the system to steady the varying of the client behavior. Last but not least, we set up several parameters to make it possible that the clients of HBAU choose the corresponding algorithm or even the algorithms designed by themselves.
Keywords/Search Tags:HDFS, Small Files Problem, Client Behavior Analysis, Association Rules Mining
PDF Full Text Request
Related items