Font Size: a A A

A Reconfigurable Accelerator For Deep Learning Training Based On FPGA

Posted on:2022-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:T T YinFull Text:PDF
GTID:2518306725490764Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,deep learning has made great breakthroughs in many fields,such as object detection,image classification,speech recognition,autopilot,super-resolution reconstruction,etc.Although more and more powerful models can achieve desirable performance,it is followed by the rapid increase of computational complexity and a large demand for storage volumn.At present,the training and inference of deep neural networks(DNNs)can be deployed on ultra-high computing power GPU clusters,but the deployment of models on edge devices still faces serious challenges.First of all,the limitation of Moore's law makes people can't always hope that the progress of technology will bring infinite improvement of computing power.Secondly,the huge amount of memory requirements can not be borne by edge devices.Therefore,to implement DNNs on the edge platforms with limited resources and not affect their processing speed,it is necessary to design a hardware accelerator with high energy efficiency,high throughput,low power consumption,and low latency for deep learning.For the inference phase,there have been a lot of hardware accelerators.People train models on the high computing power platform,and then deploy the trained model on the edge device for the corresponding inference task.However,there are only a few hardware accelerators for the training phase.Due to a large amount of data,frequent access with on-chip and off-chip memories,and high memory requirements in the training phase,the difficulty of hardware deployment is undoubtedly increased.In addition,the training process involves more forward and backward calculation.If we directly use the inference accelerator,not only we can not achieve satisfactory results,but also it may be counterproductive.Therefore,the best way to solve this problem is to design a dedicated hardware accelerator for deep learning training.In this context,this paper focuses on the above problems and proposes the accelerator design for DNN training.Firstly,this paper introduces the current neural network accelerator.At present,most accelerators are designed for the inference stage,but there is little acceleration work in the training stage.This paper presents a reconfigurable deep neural network training accelerator based on FPGA.Therefore,a reconfigurable processing unit is designed in this paper,which supports various computing modes flexibly in a unified architecture.In addition,an optimized architecture is proposed to realize the calculation of batch standardization layer in different stages.The popular model ResNet-20 of CIFAR-10 dataset is implemented on Xilinx VC706 platform by using the framework proposed in this paper.The experimental results show that the accelerator is significantly superior to other work.As one of the most representative deep learning networks,generative confrontation network(GAN)has been widely used in image generation,style conversion,video generation,and other fields in recent years.However,due to the high computational complexity of the training network,a large amount of intermediate storage data,and the iterative update of the generator and discriminator network in the training of GAN,which is more complex compared with the traditional deep neural networks.So it is a very challenging problem to train GAN on embedded platforms.So we propose an FPGA-based reconfigurable accelerator for efficient GAN training.Firstly,convolution calculation can be regarded as a large number of multiplication and accumulation operations,and the principle of the fast FIR algorithm is to increase the number of adders appropriately to achieve the purpose of reducing the number of multipliers,which coincides with the idea of minimizing multipliers in hardware design.Therefore,in this paper,the cascaded fast FIR algorithm(CFFA)is optimized towards GAN training,and a fast convolution processing element(FCPE)based on the optimized algorithm is introduced to support various computation patterns during GAN training.Secondly,a well-optimized architecture based on FCPEs is presented,which is flexible to support forward,backward,and weight gradient phases of GAN training.Finally,the training of a prevailing network(DCGAN)is implemented on Xilinx VCU108 platform with our methods.Experimental results show that our design achieves 315.18 GOPS and83.87GOPS/W in terms of throughput and energy efficiency,respectively.The comparison results prove that our accelerator significantly outperforms previous works.To sum up,this paper is committed to the design of hardware accelerators for efficient deep learning training.By combining algorithm optimization and hardware architecture design,we proposed the corresponding acceleration scheme based on FPGA.Compared with other research works in this direction,our design shows significant improvement in the overall performance.
Keywords/Search Tags:Deep learning, generative adversarial networks, hardware accelerator, training accelerator, reconfigurable design, FPGA
PDF Full Text Request
Related items