Font Size: a A A

Research And Implementation On The Distributed Storage System Based On HDFS

Posted on:2017-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y CuiFull Text:PDF
GTID:2308330485984546Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, the traditional technology has been unable to meet the growing demand for storage, distributed storage system came into being. HDFS is Apache Hadoop’s distributed storage system that can run on low-cost large-scale server clusters with high fault tolerance for handling large data sets optimized with higher data throughput. Nevertheless, HDFS’s design is not perfect, there are some drawbacks. This thesis researched the HDFS distributed storage model, improved HDFS poor scalability and high write latency disadvantages, the main work and research results are as follows:(1) A distributed Name Node strategy with dynamic load balancing. HDFS is designed as single-node metadata server(NameNode) to manage metadata, although this design is simple and efficient to implement, but there are three drawbacks: poor system scalability, low metadata availability, and poor isolation. To address this issue, the existing HDFS Federation strategy and NCUC stragety both have single point of failure and they don’t have a dynamic load balancing. This thesis presents a distributed NameNode strategy with dynamic load balancing to solve this problem. In the proposed strategy, meta data stored in the Name Node cluster in the form of multiple copies, the metadata distribution takes into account the performance difference between heterogeneous servers and the current load, when the load between the NameNode dynamic changes the proposed strategy will start dynamic load balancing, the client uses the metadata cache strategy to reduce access time, when NameNode fails or loses replica of metadata the proposed stragety will automatically start metadata recovery.(2) A delaying adaptive replica synchronization strategy. HDFS uses a simple strategy with strong consistency to synchronize between replicas, although this design can always guarantees consistent state between the copies, but results disadvantages as low write throughput and high write latency, so HDFS is not suitable for scenes which have high performance requirements for writing. To address this issue, some of the existing solutions there are also some disadvantages: dynamic replication synchronization strategy requires NameNode participating in the replica synchronization, and Quorum policy is relatively poor on read performance. This paper proposes a delaying adaptive replica synchronization strategy to solve this problem: During a write operation, the proposed strategy will choose partial replicas to perform a replica synchronization, the remaining replicas will do a delayed adaptive replica synchronization, improved write performance; by use BlockList structure, no NameNode participating in the replica synchronized and compared to the Quorum strategy the proposed stragety optimize the read performance.(3) In this thesis, HDFS is improved using a distributed NameNode strategy with dynamic load balancing and a delaying adaptive replica synchronization strategy proposed in this thesis.This thesis solves the HDFS poor scalability, low availability of metadata, write latency high.
Keywords/Search Tags:HDFS, distributed storage system, metadata management, replica synchronization, consistency problem
PDF Full Text Request
Related items