Font Size: a A A

Research On Tail Latency Optimization For Cloud Storage Systems

Posted on:2020-09-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y SuFull Text:PDF
GTID:1368330590958852Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development and maturity of cloud technology,more and more organizations,companies,and individuals have adopted cloud platforms.In cloud environments,the cloud storage service is one of the core services.In recent years,issues on the tail latency optimization in cloud storage systems draw a lot of attention,from both the academic community and industry.Provided the number of users and the complexity of applications(serving one request of the application requires hundreds of data accesses),cutting the tail latency plays an important role in optimizing the performance of the system.The tail latency is the latency of the slowest requests in the system.The high percentile latency is usually used to measure the tail latency.Approaches are proposed here on the tail latency optimization in cloud storage systems.For cloud object storage systems that serve Web applications,this paper establishes an analytic-based model,named COSModel,that predicts its tail latency according to its requests processing procedure.COSModel can accurately predict the tail latency and is robust to workload changes.Existing performance models are either simulation-based models or analytic-based models.In simulation-based models,the system should be benchmarked in various workloads in advance.Then the model predicts the performance of the system according to the benchmark results.In contrast,COSModel abstracts and approximates the cloud object storage system using a queuing model.by solving the queuing model,COSModel could predict the tail latency of the system.Compared with simulationbased models,COSModel is robust to workload changes due to not relying on the results of workload related benchmarking for prediction.Existing analysis-based models ignore important factors that affect system tail latency,including serving requests requires to perform operations of data locating and metadata reading that may accessing storage devices,cloud object storage systems using event-driven concurrent processing architecture,requests need to wait before the storage server accept()them,and so on.COSModel takes into account the impact of these factors on tail latency.Compared with existing analysisbased performance models,COSModel is able to predict the tail latency of cloud object storage systems with more accuracy.Experimental results show that the average prediction error of COSModel is 2.63%,which can reduce the prediction error of baseline models by 90%.Considering that the data-intensive applications generally access data in a skewed and bursty way,this paper proposes a selective replication method,named Lemming,for cloud object storage systems,which can work against bursty workloads.Lemming is able to effectively leverage idle resources in the system to reduce the tail latency.The selective replication method refers to determining the number of data's replicas according to the popularity of the data.Under skewed workload,using the selective replication method could improve the system performance.Due to the need to perform data migration,frequently changing the number of replicas of data can have a significant negative impact on system tail latency.Therefore,existing selective replication methods only consider the long-term changes in workloads.In order to cope with unpredictable short-term changes in bursty workloads,Lemming proposes using an aggressive strategy to adjust the number and layout of replicas in the system.With the aggressive strategy,Lemming tries to minimize the impact of bursty workloads on tail latency at the expense of migrating more data and performing data migration more frequently.In order to reduce the performance overheads of creating extra replicas,Lemming proposes a 2-Phase Data Migration(2PDM)approach.With the 2PDM,Lemming first reads data from the source node(usually a node of heavy workloads)to the client leveraging the normal request of data access,then migrate the data from the client to the destination node.Compared with migrating data directly from the source node to the destination node,using 2PDM avoids performing extra data reads on the source node of heavy workloads.Experimental results show that Lemming can reduce the mean response latency of the system by up to 95.4%,and the 99 th percentile latency of the system by up to 98.0%.For latency-sensitive distributed storage systems that have strict requirements on response latency,this paper proposes a replica selection method,named NetRS,to improve the efficiency of replica selection and reduce the tail latency of the system.NetRS leverages the emerging network devices,including programmable switches and network accelerators.In latency-sensitive storage systems,each client generally directly selects replicas for data read requests instead of using the traditional centralized replica selection method,thereby limits the latency overheads of performing replica selection.However,the client-based replica selection method usually leads to poor replica selection.It is because each client cannot obtain information about system global status in time,and a large number of independent replica selection nodes select the same replica server at the same time.NetRS utilizes the programmability of emerging network devices to offload replica selection tasks from lots of clients to network devices that are much fewer.By significantly reducing the number of independent replica selection nodes in the system,NetRS can effectively improve the efficiency of replica selection.In order to optimize the placement of replica selection nodes in the data center network,NetRS formalizes the placement problem and proposes heuristic algorithms that can quickly find approximate optimal placements.In order to make NetRS internals be transparent to clients and servers and supports a variety of different replica selection algorithms,NetRS exploits a flexible format of packets and costumizes the packet processing pipeline of each network devices.Experimental results show that NetRS can reduce the mean response latency of the system by up to 50.3%,and reduce the 99 th percentile latency of the system by up to 69.7%.
Keywords/Search Tags:Cloud storage, Tail latency, Performance model, Selective replication, Replica selection
PDF Full Text Request
Related items