Font Size: a A A

Performance Optimization Research Of Deep Learning Prediction Serving System Based On Hardware Partition Strategy

Posted on:2022-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:S J LiFull Text:PDF
GTID:2518306731487764Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep learning is a new research direction of machine learning and machine learning is to make machines reach the level of artificial intelligence through training on large amounts of data.At this stage,deep learning has become a practical resolution to solve many problems in machine learning domain,which has achieved widespread success in much scenes,such as target monitoring,voice recognition,intelligent question answering,recommendation systems and autonomous driving,achieved far exceed compring with previous related technologies.More and more attention are paid to Deep learning research,not only on the design of deep learning algorithm or model,but also on the method of constructing and training deep models by the improvement of hardware equipment.And the rise of deep learning has set off a new wave of artificial intelligence.With the rise of deep learning,research on deep learning prediction serving systems has also become a novel theme.Although there is a large amount of literature and software dedicated to training deep learning models,much less research on predicting serving system in the model inference stage.Precisely because the purpose of model training is to solve practical problems in the inference and prediction stage of models,systematic work of deploying models to provide online predicting services is emerging.The deep learning prediction serving system is a serving system coupled with the deep learning model.Serving system is used in the model deploying and inference stage,which can deploys models simplely and provide online prediction by model inference.The system provides inference and prediction functions.And the service system needs to to provide robust,accurate,and low-latency reasoning and predicting services.This is a computing-centered servicing system.Every time the model performs inference calculations,it takes up a lot of hardware resources,such as GPU,CPU core,cache and memory bandwidth.In the serving system,each model provides a calculation for a prediction request.The SLO(Service Level Objective)time required to respond to kinds of prediction requests is also different.Requests can be divided into two kinds,real-time task(returning inference and prediction results immediately)and non-realtime tasks.Although it is better to response prediction request in the shortest possible time,in some case the hardware resources are not so sufficient,how to meet the SLO time requirment of different prediction tasks need some Optimization strategy,make much more prediction requests meet the SLO time limit.In this paper we studies the optimization strategies in the inference and prediction stage of deep learning models,analyzes the application scenarios of various optimization strategies,and focuses on the prediction time of the inference.Combining resource direction technology,we proposes a hardware resource partition strategy applied in the predicting serving system.Reduce the mutual influence between running models to achieve the result of reducing the average time of prediction request.Combining the feedback of service quality to adjust the partition of hardware resources dynamically,maximize the quality of service and improve the utilization of hardware resources.
Keywords/Search Tags:Deep Learning, Serving System, Hardware Partition, Inference Predicting
PDF Full Text Request
Related items