Font Size: a A A

Research And Implementation Of Distributed Storage Based On HDFS

Posted on:2015-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:K ShuFull Text:PDF
GTID:2308330473953084Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information society, the Internet has seen an explosion of data. Large-scale of data production will inevitably accompanied by massive data storage, but traditional storage systems can hardly push their limits. In this case, the distributed storage system comes up.Hadoop is a distributed computing framework developed by the Apache software foundation, and has been widely used in IT companies at present. Map Reduce and HDFS is the core of the Hadoop, they provide computing and storage services for data.HDFS is Hadoop’s Distributed File System, and also an open source implementation of GFS, which is designed by Google. Therefore, its structure is almost the same with GFS, both of which are master-slave model. As Hadoop is being used more and more widely, as well as the storage performance of HDFS, many companies and research institutions is researching HDFS now. But HDFS still has defects, it is being improving continuously.The structure and the mechanism of HDFS is analyzed in this thesis. Then it points out the design defects of HDFS and improves its replication strategy. The main work is as follows:(1) The HDFS default static replication redundancy strategy can not recognize hot data, which makes the nodes storing the data a bottleneck of the cluster. This thesis designs a replication redundancy strategy based on the heat of data. The strategy calculates and predicts the access pattern of each file. Statistical cycle differs from file to file. It changes with the access frequency of the file. In this case, it can quickly reflect the change of heat for each file and then add or remove replications according to which. Using this strategy can accelerate the system’s response time, increase the cluster’s throughout and reduce the operation time.(2) HDFS does not consider the heterogeneity of Data Nodes. If a node of poor performance stores more data, it needs to bear more loads in the reading process later,and the nodes of high performance idles their computing power, which causes the unbalanced load distribution. Aiming at this problem, this thesis proposes a placement strategy based on the performance evaluation and the network distance of nodes. It provides an interface to let the users to customize node’s load information and theweight. Then it uses the improved TOPSIS to dynamically evaluate nodes. It selects nodes according to the evaluation and the network distance at last. This strategy allows the users to customize their concern, and can balance the load of each node and improve the overall system performance.(3) A lot of simulation and experiment, and a cloud storage system of C/S model is developed based on the improved HDFS cluster. A comparison of the performance using different strategies is made, which shows that the improved strategy in this thesis can enhance the cluster very well.
Keywords/Search Tags:HDFS, Distributed storage, dynamic replication, replication placement
PDF Full Text Request
Related items