Font Size: a A A

Research And Design Of High Resilience Solution In HDFS

Posted on:2016-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:G W YuanFull Text:PDF
GTID:2308330470466152Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
IIWith the rapid growth of the data-intensive applications, all kinds of data arebecoming more and more large since the beginning 15 years of this century.Traditional storage systems are becoming more and more powerless in the field of huge amount of data processing and management, and distributed computing framework is becoming suitable in huge data processing and management. Hadoop is an implementation of large-scale distributed computing framework, with high throughput, high reliability, high scalability, etc. Therefore it is widely used in the field of big data processing and storage. Hadoop Distributed File System(HDFS), as a fundamental component of Hadoop, is responsible for providing high performance data storage and management service for Hadoop.HDFS is designed of separating real data from metadata, which adopts master-slaves architecture, while the single master node manages the whole set of metadata of the data located in the HDFS. However, this architecture brings a single point of failure(SPOF) problem. If the master node of HDFS fails, the whole system would be unavailable.To address the SPOF problem, we analyzed some popular high available solutions in the industry and after we make a good understanding of the HDFS architecture, we propose a high-availability solution which implements a hot-standby mechanism in the HDFS. This solution solves the SPOF problem in the Hadoop while it brings little overhead for the HDFS. Base on the original Hadoop HDFS architecture, we introduced a hot standby Namenode. As the system runs, the metadata is synchronized continuously between the primary master and standby node to maintain the consistence of the namespace. In order to reduce the switch time of the failover, it only imports the Edits. Then the system can switch to the standby Namenode rapidly when the active Namenode is unavailable.Except this SPOF problem, there is another problem in the HDFS. The visiting rate of the data blocks is different, some is high and some is low. If we configured a fixed replication number of the data blocks, we can’t make good use of the resource in the cluster. So after we analyzed the block storage solution in HDFS and the parity check code algorithm, we proposed an optimized solution which is based on the parity check code algorithm for the block storage. We applied the parity check code algorithm to the HDFS. Compared to the solution that replication factor 3recommended by Apache, our solution not only is a high available solution for Datanodes in HDFS, but also it improve the utilization of the HDFS storage obviously.By this mean, we not only improve the utilization of the cluster, but also save a lot of resource such as device, electricity and mantainance fee.
Keywords/Search Tags:high availability, HDFS, Namenode, Datanode, failover
PDF Full Text Request
Related items