| The rapid development of internet applications brings exponential growth of data volume,which makes storage systems with higher and higher demand for capacity and performance.Ensuring high reliability is the foundation of storage systems.In general,replication and erasure coding are two common storage strategies to provide fault tolerance in storage systems.Replication ensures the high reliability of storage systems through multiple redundancies.Compared with replication,erasure coding(EC)can provide high fault tolerance with low storage cost,but in the process of data read and write,degraded read and failure recovery,needs a lot of cross-node data transmission and encoding/decoding operations,so network and computing often become the performance bottleneck.In general,erasure-coded storage systems focus on multiple indicators such as fault tolerance,read and write performance,degraded read performance,and recovery performance.This paper mainly focuses on the coding workflow designs and optimizations of erasure coding under different application scenarios and requirements in highly reliable storage systems to meet the systems’ requirements for key targets.Specifically,it includes a data layout design for erasure-coded storage systems to optimize failure recovery performance,a recovery task scheduling design in erasure-coded storage systems to balance recovery load,and a workflow design of erasure coding in disaggregated memory systems to provide high-reliability and high-performance storage systems.Its main research contents and contributions are as follows:(1)Research on a data layout design for erasure-coded storage systemsIn distributed storage systems,random data layout is commonly used to ensure balanced storage,but the traditional random data placement induces massive cross-rack traffic and imbalanced load during failure recovery batch by batch,which degrades the recovery performance significantly.In addition,various erasure codes coexisting in a DSS exacerbate the above problems.In this paper,we propose PDL,a uniform data layout,to optimize failure recovery performance in DSSes.PDL is constructed based on Pairwise Balanced Design,a combinatorial design scheme with uniform mathematical properties,and thus presents a uniform data layout for mixed erasure codes.Then we propose rPDL,a failure recovery scheme based on PDL.rPDL reduces cross-rack traffic effectively and provides a nearly balanced distribution of cross-rack traffic by uniformly choosing replacement nodes and retrieving determined available blocks to recover the lost blocks.We implement PDL and rPDL in HDFS 3.Compared with the existing data layout and recovery scheme in HDFS,experimental results show that rPDL achieves much higher recovery throughput,6.27× on average for single-node failures,5.14× for multi-node failures and 1.48× for single-rack failures,respectively.It also reduces degraded read latency by an average of 62.83%,and provides better support to front-end applications under failures.(2)Research on a recovery task scheduling design in erasure-coded storage systemsErasure coding has been commonly used to offer high data reliability with low storage cost.Upon failures,the lost blocks are recovered in batches.Due to the limited number of stripes,the data layout within a batch is non-uniform.Together with the random selection of source and replacement nodes for recovery tasks,the recovery load among surviving nodes is skewed within a batch,which severely slows down failure recovery.To solve this problem,we present SelectiveEC,a new recovery task scheduling module that provides provable network traffic and recovery load balancing for large-scale erasure-coded storage systems.It relies on bipartite graphs to model the recovery traffic among surviving nodes.Then,it intelligently selects recovery tasks to form batches and carefully determines where to read source blocks or to store recovered ones,using theories on a perfect or maximum matching and k-regular spanning subgraph.SelectiveEC supports single-node failure and multi-node failure recovery,and can be deployed in both homogeneous and heterogeneous network environments.We implement SelectiveEC in HDFS 3,and evaluate its recovery performance in a local cluster of 18 nodes and AWS EC2 of 50 virtual machine instances.SelectiveEC increases the recovery throughput by up to 30.68%compared with state-of-the-art baselines in homogeneous network environments.It further achieves 1.32×recovery throughput and 1.23×benchmark throughput of HDFS on average in heterogeneous network environments,due to the straggler avoidance by the balanced scheduling.(3)Research on a workflow design of erasure coding in disaggregated memory systems architectureIn disaggregated memory systems,erasure coding can provide high reliability with low memory cost.However,as the latency of one-sided RDMA goes down to the microsecond level,coding computation becomes the new bottleneck in disaggregated memory with EC,instead of the expensive network and disk I/Os as in traditional storage systems.To enable efficient EC in disaggregated memory,we first present three key insights for guiding the design by thoroughly analyzing the workflows of coding and RDMA transmission.We then develop MicroEC,which redesigns the coding stack with cache optimizations and also leverages efficient pipeline to optimize RDMA transmission.We implement a prototype with general operations support,such as write/read/degraded read/recovery.Experiments show that MicroEC significantly reduces the coding latency,which matches the low latency of one-sided RDMA,especially for large objects of size greater than 1MB.It also achieves up to 2.08× and 1.74×write throughput,compared with the state-of-the-art EC and 3-way replication polices,respectively. |