Research And Implementation Of Distributed Storage Based On HDFS

Posted on:2015-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:K Shu

Full Text:PDF

GTID:2308330473953084

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information society, the Internet has seen an explosion of data. Large-scale of data production will inevitably accompanied by massive data storage, but traditional storage systems can hardly push their limits. In this case, the distributed storage system comes up.Hadoop is a distributed computing framework developed by the Apache software foundation, and has been widely used in IT companies at present. Map Reduce and HDFS is the core of the Hadoop, they provide computing and storage services for data.HDFS is Hadoop’s Distributed File System, and also an open source implementation of GFS, which is designed by Google. Therefore, its structure is almost the same with GFS, both of which are master-slave model. As Hadoop is being used more and more widely, as well as the storage performance of HDFS, many companies and research institutions is researching HDFS now. But HDFS still has defects, it is being improving continuously.The structure and the mechanism of HDFS is analyzed in this thesis. Then it points out the design defects of HDFS and improves its replication strategy. The main work is as follows:(1) The HDFS default static replication redundancy strategy can not recognize hot data, which makes the nodes storing the data a bottleneck of the cluster. This thesis designs a replication redundancy strategy based on the heat of data. The strategy calculates and predicts the access pattern of each file. Statistical cycle differs from file to file. It changes with the access frequency of the file. In this case, it can quickly reflect the change of heat for each file and then add or remove replications according to which. Using this strategy can accelerate the system’s response time, increase the cluster’s throughout and reduce the operation time.(2) HDFS does not consider the heterogeneity of Data Nodes. If a node of poor performance stores more data, it needs to bear more loads in the reading process later,and the nodes of high performance idles their computing power, which causes the unbalanced load distribution. Aiming at this problem, this thesis proposes a placement strategy based on the performance evaluation and the network distance of nodes. It provides an interface to let the users to customize node’s load information and theweight. Then it uses the improved TOPSIS to dynamically evaluate nodes. It selects nodes according to the evaluation and the network distance at last. This strategy allows the users to customize their concern, and can balance the load of each node and improve the overall system performance.(3) A lot of simulation and experiment, and a cloud storage system of C/S model is developed based on the improved HDFS cluster. A comparison of the performance using different strategies is made, which shows that the improved strategy in this thesis can enhance the cluster very well.

Keywords/Search Tags:

HDFS, Distributed storage, dynamic replication, replication placement

PDF Full Text Request

Related items

1	Research And Optimization Of Distributed Storage Based On HDFS
2	The Research And Implementation Of Replication Management In HDFS
3	The Research Of Node’s Status-based Distributed File System Storage Replication Distribution
4	The Multi-tenant Replication Data Resources Management Mechanism In Saas
5	Research On File Accessing Performance Optimization Based On HDFS
6	Research On Data Partition And Replication For Online Social Network Storage Systems
7	Dynamic replication in wide area environments using message logging
8	Research On Theory And Methods Of Data Placement Optimization In Distributed Storage
9	Study Of Data Replication Technology In Cloud Storage Environment
10	Research On Dynamic Replication Strategy In Cloud Storage