A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems

Posted on:2017-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:C X Sun

Full Text:PDF

GTID:2308330503485306

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

The Hadoop platform has been widely used in distributed computing because of its reliability and scalability. One of the most important components of Hadoop is the Hadoop Distributed File System(HDFS), which is a very efficient way to store TB(Terabyte) or PB(Petabyte) level data. However, when HDFS stores large amount of small files, whose size is significantly smaller than the HDFS block size, the Problem of Small Files will occur, which will jeopardize the Namenode’s main memory and reduce the data transmission efficiency.Extended HDFS(EHDFS) has been proposed to solve the Small Files Problem. It is valid when a few files are stored in the system. But as long as the amount of files boosts, EHDFS no longer makes any difference.In our research, the basic approach is to analyze clients’ behavior to mine the association rules between files, and to combine the correlated files. This helps cutting metadata footprint in NameNode’s main memory. Further, providing prefetching mechanism will decrease preparation time due to requesting metadata, which enhances data transmission efficiency.We add the Behavior Analysis Unit(BAU) node to the HDFS first, so that our system consists of NameNode, DataNodes, and BAU. The DataNodes send the BAU a list of files that each client has requested by the recorded report. The BAU generates the complete record from all DataNodes in the system. Then, the data mines the association rules between the files and combines the related files.We then add a client cache in the client side. When the client requests our system for a file that is related to other files, a metadata of requested files as well as correlated files will be obtained from the Namenode and be saved in the client cache. Thus, when the client requests another file, if its metadata were in the client cache, the client could directly go to corresponding DataNodes for the file instead of asking NameNode for the metadata.We finally focus on how to stabilize the performance of HBAU by guaranteeing the safe data storing and reinforcing the efficiency in solving the small files problem to make the HBAU more practical rather than a theoretical distributed file system. A special writing procedure is designed and applied only on the BAU node when writing the combined files, which shortens the Stopping period. Moreover, the recommendation mechanism is introduced to the system to steady the varying of the client behavior. Last but not least, we set up several parameters to make it possible that the clients of HBAU choose the corresponding algorithm or even the algorithms designed by themselves.

Keywords/Search Tags:

HDFS, Small Files Problem, Client Behavior Analysis, Association Rules Mining

PDF Full Text Request

Related items

1	The Research And Implementation Of Method Regarding To The Small Files Problem Of Hadoop
2	Reading And Writing Strategy Research Of Massive Small Files Based On HDFS
3	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model
4	The Research Of Increase The IO Speed Of Small Files In HDFS
5	Research And Optimization Of The Distributed Storage On HDFS
6	The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage
7	Research And Application Of Small Files Storage Method Beased On HDFS
8	Research And Optimization Of Mass Small Files Based On HDFS
9	Research On Database User Behavior Pattern Analysis Based On Clustering Analysis And Association Rules
10	Research On Association Rule Mining Algorithm Based On User Behavior Analysis