Font Size: a A A

Research Of Repair And Response Of Failed Data In Distributed Storage Systems

Posted on:2019-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T FangFull Text:PDF
GTID:1368330563490918Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In data centers,distributed storage systems consisting of hundreds to thousands of storage nodes are widely deployed as the infrastructure of large-scale Internet services,such as web searching,electronic commerce,and social networking.In such scenario,node failures are the norm rather than the exception.To prevent data unavailability or data loss from node failures,data redundancy techniques are widely adopted.When nodes fail,redundant data is used to repair lost data or reconstruct unavailable data for user requests.These two procedures are only affect data reliability,but also have impacts on serviceability.Therefore,they have import research value and high applicable value.Failure identification of chunks relays on failure identification of nodes in traditional failure identification.Risk-Aware Failure Identification(RAFI)is proposed.RAFI identifies chunk failures according to the risk level of their host stripes,which are determined by the total failed chunks in the stripes.For chunks in high risk stripes,short failure identification time thresholds are adopted,thus improving the data reliability and availability.For the chunks in low risk stripes,long failure identification time thresholds are adopted,thus reducing the repair network traffic.Experiment results show that the reliability,availability and serviceability(RAS)are simultaneously improved by RAFI.For example,in an RS(6,3)-coded storage cluster with 1000 storage nodes,in the best cases,RAFI can improve the reliability by a factor of 11,and reduce the unavailability and repair network traffic by45% and 28%,respectively.Fixed check interval is used to identify failures in traditional failure identification.Adaptive check intervals(ACI)are proposed.When finding failed chunks,ACI shortens the check interval to expedite the identification of failed chunks in high risk stripes,thus improving the data reliability.Otherwise,ACI leverages a longer check interval to mitigate the computational cost caused by checks on the manager node,thus improving the serviceability.Both simulation results and experimental results show that,in a 3-replica storage cluster with 1000 storage nodes,in the best cases,cooperated with RAFI,ACI can further improve reliability by a factor of 3.2.Meanwhile,the computational time caused by checks on the manager node increases 18%.High latency of degraded reads in Reed-Solomon storage clusters.Degraded reads with parallel reconstruction(DRPR)is proposed.DRPR chooses an under-loaded storage node as the starter node and utilize more available source nodes to increase the network bandwidth of transmitting data in degraded reads,thus reducing the latency of degrade reads.Prototype-based experimental results show that,compared to state-of-the-art solutions,DRPR can reduce the latency of degraded reads by 10%in most cases.
Keywords/Search Tags:Distributed storage systems, Erasure codes, Reliability, Degraded read, Availability, Serviceability, Replication
PDF Full Text Request
Related items