Font Size: a A A

Research On Cloud System Optimization Based On Learning-augmented Design

Posted on:2022-04-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:1488306314455284Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The learning-augmented system is a new design based on introducing Machine Learning(ML)and Deep Learning(DL)into the traditional cloud system architecture.It is aimed to further improve the performance and adaptability of the cloud system on the basis of traditional methods.Specifically,with the rapid development of the Internet service as well as the continuous improvement of user service quality require-ments,the data workload of cloud providers' service business has become increasingly diverse.The complex and changeable system environment brings new challenges to performance optimization,such as dynamic parameter adjustment,routing,scheduling optimization and load balance enhancement.Even in the same framework,there are dif-ferent system states and optimization schemes among different services and workloads.Traditional optimization schemes are usually inseparable from the professional knowl-edge of specific business scenarios,so there are still limitations in migration,scalability and adaptability.In recent years,with the rapid iteration of artificial intelligence research,ML and DL algorithms with excellent abstract modeling ability have provided new ideas for system state modeling and system performance optimization.Through the observation and collection of the system behavior and the data workload,the system behavior and the data stream are abstractly fitted so as to optimize the performance better.However,due to the natural uncertainty of the ML and DL models as well as the high availability requirements of the system,it is hard to optimize the system performance effectively by collecting historical data and applying the ML and DL models to system scenarios simply.As a result,it is difficult to deploy the ML and DL models in the real production environment.The paper is aimed to design a Learning-augmented System toolchain for performance optimization.A series of research is carried out in order to better achieve the coordination of system optimization task scenarios with data collection,model infer-ence,training combination and model testing.First,according to system requirements in production,we study the data exploration and exploration strategy and design the specific optimization scheme.Then,we propose a general model automatic combina-tion modeling strategy.Finally,we design a novel black-box model testing tool based on error distribution in order to help developers understand the model behavior.The researches of this paper are mainly with the following four parts.(1)Learning-augmented system design for parameter optimization.The end-to-end parameter optimization is one of the classic system performance optimization tasks.System performance metrics can be optimized through configuration adjustment.How-ever,for parameter adjustment,it is necessary to wait for the system startup and stability before collecting a valid configuration sample.Moreover,performance metrics(such as throughput and latency)are prone to external factors,resulting in noise and inaccurate measurement.In order to solve the problems about the long sampling cost and the noise of performance evaluation,we propose Metis-a robust parameter optimization service based on Bayesian Optimization(BO).We design an outlier detection and a combined acquisition function so that valid data can be collected actively in a noisy environment.Metis is verified in the real end delay optimization task of BingAds's IDHash cache.The results show that our method provides a parameter combination scheme with a lower tail latency than others under the premise of the same number of samples.(2)Learning-augmented system design for priority scheduling.The end-to-end real-time scheduling task is another classic system performance optimization task.For the performance evaluation of real-time scheduling,the scheduling algorithm cost and the back-end task execution cost should be considered.Therefore,it is necessary to consider the trade-off between the two steps to achieve the optimal effect.To solve this problem,we propose LearnedRanker-a sequence scheduling tool based on deep learn-ing.The end-to-end latency is optimized through the scheduling priority of the output weight as well as the automatic adjustment of the model structure.In addition,in order to optimize the performance of the scheduling model,we apply a gradient-based ac-tive data generation and a pruning technology.We use regular expression(regex)rule matching systems as a scenario to achieve early termination and improve the speed of rule checking.Our evaluation includes two real rule sets(CRS and Snort)and three pop-ular regex engines(PCRE,RE2 and HyperScan).LearnedRanker is taken as a plug-in into those engines to evaluate the end-to-end latency.According to the experimental re-sults,LearnedRanker shows a reduction of overall latency significantly compared with the static algorithm and the traditional scheduling method.(3)The Design and Operation of Learning-Augmented Systems.Although ma-chine learning(ML)and deep learning(DL)provide new possibilities into optimizing system design and performance,taking advantage of this paradigm shift requires more than implementing existing ML/DL algorithms.We report our years of experience in designing and operating several production learning-augmented systems at Microsoft.AutoSys is a framework that unifies the development process,and it addresses common design considerations including ad-hoc and nondeterministic jobs,learning-induced system failures,and programming extensibility.Furthermore,this paper demonstrates the benefits of adopting AutoSys with measurements from one production system,Web-Search.Finally,we share long-term lessons stemmed from unforeseen implications that have surfaced over the years of operating learning-augmented systems.(4)Learning-augmented system model testing tool based on error distribution.In terms of the deployment about the release of new system frameworks and versions,it is necessary to ensure the stability and security of the system by testing repeatly.However,a model in learning-augmented system is almost a black box.It is complex,difficult to understand and naturally uncertain.Potential wrong decisions can lead to perfor-mance degradation,system blocking or even crash,which hinders the deployment of the Learning-augmented System in the real production environment.Traditional model testing tools aim to generate samples to make model decisions go wrong by adversarial generation or random disturbance.While the isolated generated samples are insufficient to coverage and explain the whole data space.Therefore,we propose Tapio-a black-box model test tool based on error distribution estimating,which can fit the global error distribution quickly by minimizing the area of the maximum uncertainty region.Thus,the global performance of the tested model is presented to assist developers in model fine-tuning or framework constraint appending.It is aimed to improve the system's reliability and promote its deployment in real productions.The performance predic-tion model is tested in two different scenarios-RocksDB and Azure VM respectively.It is verified that Tapio can improve the throughput performance of RocksDB and the prediction accuracy of Azure VM effectively.
Keywords/Search Tags:Learning-augmented System, Cloud System, Machine Learning, Per-formance Optimization, Performance Modeling
PDF Full Text Request
Related items