Font Size: a A A

Research On Distributed Inference Acceleration Technology For Convolution Neural Network

Posted on:2023-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ShenFull Text:PDF
GTID:2558307097494914Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,applications based on Convolution Neural Network(CNN)have been widely applied in many fields.CNNs are computation-intensive,because they usually involve a large number of complex numerical calculations such as matrix multiplication and automatic differentiation.However,the computing capacity and resources of devices on the Internet of Things are limited.It will impose huge computational pressure on the devices and is difficult to meet the real-time requirements of users,if we run CNNs directly on these devices.Therefore,how to balance the limited computing resources of terminal devices and the huge resource demands of CNNs is one of the research hotpots in the edge intelligence field.At present,a popular solution is to use model partitioning to divide CNNs,then complete collaborative inference between multiple devices jointly.However,the increasingly complex network structure and dynamic environment bring many challenges to collaborative inference.To solve these challenges,this paper mainly research model partitioning algorithm and collaborative inference framework in depth.The detailed research contents are summarized as follows:In order to improve the CNN inference speed on resource-limited devices,this paper proposes an on-demand fine-granularity partitioning method(OFPM).First of all,we propose a partitioning method(FPM)for homogeneous environments which combines horizontal and vertical partitioning.FPM can divide the convolution layer into multiple convolution layer partitions by horizontal partitioning,and determine the optimal execution device for each layer by vertical partitioning,to minimize the overall inference latency.Secondly,considering the heterogeneity of devices and the dynamic nature of networks,we design an on-demand partitioning strategy(OPM)for heterogeneous environments,then combine OPM and FPM to form OFPM.OFPM can automatically adjust the height of input feature map slice of each convolution layer partition according to devices capabilities and networks conditions,to average the computation on each device as much as possible,then to further improve CNN inference speed.In order to coordinate devices in heterogeneous edge environments and obtain a good end-to-end performance,a layer latency prediction model(FCPM)based on floating-point operations and CPU load and a distributed collaborative inference framework that supports fine-grained partitioning(DCFP)are proposed in this paper.DCFP includes three phases: offline training,online optimization and collaborative inference.In the offline training phase,one set of FCPM are trained for each device.In the online optimization phase,model partitioning is performed according to the layer latency predicted by FCPM and network bandwidth to generate partition decisions that adapted to the devices’ computing capabilities and network conditions.In the collaborative inference phase,all heterogeneous devices are coordinated according to the partition decisions to complete collaborative inference.Finally,this paper uses common regression algorithms to verify the effectiveness of FCPM,and simulates homogeneous and heterogeneous environments to conduct a large number of experiments for FPM and OFPM,respectively.Experimental results show that FCPM can achieve a minimum accuracy of 88%,and OFPM can improve the inference speed by 1~2.54 times compared with the advanced partitioning methods.
Keywords/Search Tags:Edge computing, Deep neural networks, Edge intelligence, Collaborative inference, Model partitioning
PDF Full Text Request
Related items