Font Size: a A A

Research On The Key Technologies Of DSP For Deep Learning Algorithm

Posted on:2020-09-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:C YangFull Text:PDF
GTID:1488306548492614Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of big data and computing resource of hardware,artificial intelligence has stepped into the age of deep learning.The age of deep learning has two most significant characteristics: one is the large amount of computing,the other is the effect far beyond the traditional method in many application fields.Although the traditional artificial intelligence methods have relatively less computation,their performance has reached the bottleneck and it is difficult to further improve the performance.At present,as long as enough calculation with amounts of train data,the deep neural network with reasonable design often has higher potential in performance than the traditional methods.In order to obtain enough computing acceleration for deep learning,currently various hardware competitions have been launched,such as optimization for deep learning on GPU,special ASIC hardware design for deep learning integrating much more computing resource,and more flexible special hardware acceleration based on FPGA.X-DSP is a programmable high-performance DSP for scientific computing.Most of the programs developed on X-DSP are GEMM,FFT and so on.In order to further expand the application field of domestic X-DSP and optimize the architecture of the next generation domestic DSP,this paper systematically studies the method of realizing deep learning on X-DSP based on the characteristics of deep learning algorithm and the architecture characteristics of X-DSP.Based on the characteristics of X-DSP architecture,this paper studies the possibility of realizing deep learning algorithm from vector computing units to array computing units without complex data flow,and studies the architecture of the computing unit for domenstic DSP in advance.In the whole research process,this paper pays attention to the principle of universal,efficient,algorithm and hardware architecture closely combined,and making full use of the computing,transmission and storage resources in the hardware as much as possible.The main work and innovation of this paper are as follows:1.Based on the architecture of domestic X-DSP processor,this paper lowers the most important convolution layer in deep learning algorithm into multi-group vector multiplications.This method is different from the method used in general programmable GPU,which lowers the convolution layer into a matrix multiplication.In the whole calculation process,the vector component unit and scalar component unit of domestic XDSP are fully utilized.The experimental results show that the utilization rete of computing resource for multiple convolution layers based on multi-core X-DSP is about65%.2.In this paper,other layers in deep learning algorithm are also studied,and different suitable mapping method is also proposed for other layers.This paper transformes the relatively complex normalization layer into vector form through data dimension conversion and matrix conversion,which could be directly processed by vector component units in domestic X-DSP.At the same time,this paper studies the fullconnection layer of deep learning algorithm.In fact,the full-connection layer is GEMM algorithm.Through the segmentation of large-scale matrix,this paper realizes the mapping of large-scale GEMM on domestic X-DSP.In this paper,the mapping of other layers in deep learning algorithm is also studied.The experimental results show that the utilization rate of computing resource in multi-core X-DSP is about 17% for full connection layer,and about 1.3% for normalization layer based on multi-core X-DSP,but both of them reach about 70% of the ideal utilization rate of computing resource.This paper also analyzes the bottleneck of the layers based on multi-core X-DSP,which amasses experience for the next generation of domestic DSP.3.In this paper,a segmentation empirical formula is proposed for the mapping of convolution layer based on domestic X-DSP.Since the total times of multiplication and addition in convolution layer is fixed,the empirical formula focuses on the analysis of data transmission and constraint conditions,and provides the segmentation basis for the convolution layer with single/multi input feature maps based on domestic X-DSP.The implementation of convolution layer after division acoording to the empirical formula on multi-core X-DSP has the least amount of data transmission between chip and external storage.4.In this paper,based on the architecture of X-DSP,the vector computing units are combined into the array computing units,and the mapping of the layers in deep learning algorithm based on the array computing units is also studied.The architecture research of the array computing units lays the foundation for the future research of domestic DSP.In order to further improve the processing speed,this paper develops the Winograd acceleration algorithm with tile size 6×6 without loss of calculation accuracy.This paper also proposes a method of multiple operations with single broadcast to reduce the amount of data access through the calculation process.The simulation results show that with ideal bandwidth,the resource utilization rate of the array computing units for convolution layer and full connection layer is about 90%.
Keywords/Search Tags:Deep learning, Domestic X-DSP, Convolutional layer, Mapping, Programming optimization, Array computing units
PDF Full Text Request
Related items