Font Size: a A A

Machine Learning Platform Improvements For Tensorflow Distributed Training And High Performance Inference

Posted on:2022-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LuFull Text:PDF
GTID:2518306563960709Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In order to facilitate information query,code transition,and model transition for developers who use Tensorflow models internally,and to improve the efficiency of model training and model inference for relevant developers,we will use container technology,container orchestration technology,Horovod deep learning tools,and high-performance deep learning support engines such as Tensor RT to make platform improvements to our machine learning platform to provide more effective extensions and improvements to existing features,so that users can effectively use distributed model training and highperformance model inference.The platform improvement adds three separate functions,which are information query,code conversion and model conversion.They provide developers with convenient information query functions,allowing them to easily and effectively convert original training code to distributed training code,and convert original Tensorflow Saved Model models to high-performance deep learning support engine models such as Tensor RT.At the same time,the platform is going to support distributed training of models,effective evaluation and high performance inference,so that users can easily and effectively carry out the corresponding processing process on the platform.I have participated in some parts of this platform improvement,and the specific work that I was involved is as follows.(1)Analyzing and organizing the general requirements for this improvement to arrive at more specific requirements.(2)Conducting research on the corresponding technologies to assist in the confirmation of solutions for distributed training and high-performance reasoning functions.(3)Performing the architectural design of the functions for this improvement.(4)Development and maintenance of the improved features.(5)Conduct brief testing of the completed functions,and go live after the tests are correct.These efforts broadly addressed the following issues:(1)Users' need for brief queries on Tensorflow's application program interface and version changes.(2)The need for users to transition their original training code to training code for distributed training.(3)The need for users to transition their original Tensorflow Saved Model to Tensor RT and other high-performance inference models.(4)The need for users to perform distributed training with Tensorflow.(5)The need for users to use transformed high-performance inference models for high-performance inference.The current system has been actually put into use,with easy and effective functions,reducing the time cost of users to obtain information,improving the efficiency of users to carry out the corresponding work,and contributing to the improvement of the company's benefits.
Keywords/Search Tags:Container Technology, Docker, Kubernetes
PDF Full Text Request
Related items