Research And Design Of High Resilience Solution In HDFS

Posted on:2016-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:G W Yuan

Full Text:PDF

GTID:2308330470466152

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

IIWith the rapid growth of the data-intensive applications, all kinds of data arebecoming more and more large since the beginning 15 years of this century.Traditional storage systems are becoming more and more powerless in the field of huge amount of data processing and management, and distributed computing framework is becoming suitable in huge data processing and management. Hadoop is an implementation of large-scale distributed computing framework, with high throughput, high reliability, high scalability, etc. Therefore it is widely used in the field of big data processing and storage. Hadoop Distributed File System(HDFS), as a fundamental component of Hadoop, is responsible for providing high performance data storage and management service for Hadoop.HDFS is designed of separating real data from metadata, which adopts master-slaves architecture, while the single master node manages the whole set of metadata of the data located in the HDFS. However, this architecture brings a single point of failure(SPOF) problem. If the master node of HDFS fails, the whole system would be unavailable.To address the SPOF problem, we analyzed some popular high available solutions in the industry and after we make a good understanding of the HDFS architecture, we propose a high-availability solution which implements a hot-standby mechanism in the HDFS. This solution solves the SPOF problem in the Hadoop while it brings little overhead for the HDFS. Base on the original Hadoop HDFS architecture, we introduced a hot standby Namenode. As the system runs, the metadata is synchronized continuously between the primary master and standby node to maintain the consistence of the namespace. In order to reduce the switch time of the failover, it only imports the Edits. Then the system can switch to the standby Namenode rapidly when the active Namenode is unavailable.Except this SPOF problem, there is another problem in the HDFS. The visiting rate of the data blocks is different, some is high and some is low. If we configured a fixed replication number of the data blocks, we canâ€™t make good use of the resource in the cluster. So after we analyzed the block storage solution in HDFS and the parity check code algorithm, we proposed an optimized solution which is based on the parity check code algorithm for the block storage. We applied the parity check code algorithm to the HDFS. Compared to the solution that replication factor 3recommended by Apache, our solution not only is a high available solution for Datanodes in HDFS, but also it improve the utilization of the HDFS storage obviously.By this mean, we not only improve the utilization of the cluster, but also save a lot of resource such as device, electricity and mantainance fee.

Keywords/Search Tags:

high availability, HDFS, Namenode, Datanode, failover

PDF Full Text Request

Related items

1	Research And Optimization On Distributed Storage Based On HDFS
2	Design And Implementation Of High Availability Solution For Hadoop
3	Research On Performance Optimization Technology Of Namenode Based On HDFS
4	Research And Optimization Of Storage Mechanism In Hadoop Distributed File System
5	The Optimization And Implementation Of HDFS High Availability Scheme
6	Research On The Metadata Management Of Multi Namenodes Based On HDFS
7	High-Availability Cluster System Based On Linux
8	Research And Design Of HDFS High Availability Based On Paxos
9	The Design And Implementation Of High Availability HDFS Management
10	Research And Implementation Of Failover Technology On Database Cluster