Font Size: a A A

Research On Software And Hardware Co-design Method Of Deep Neural Network Accelerator

Posted on:2022-09-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:K XuFull Text:PDF
GTID:1488306560993359Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Deep neural networks(DNNs)have achieved remarkable results in the fields of computer vision,natural language processing,speech recognition,etc.However,the high computational and storage costs pose great challenges for deployment of the DNN-based algorithms,especially when embedded setting with limited hardware resource is considered.In recent years,the research on neural network compression has gradually become a hot topic in academia and industry.However,part of the compression algorithm design is not combined with the actual accelerator scenario,resulting in a huge gap between the theoretical compression performance of the algorithm and the actual hardware acceleration.To overcome this hurdle,this dissertation combines a model pruning and quantization compression algorithm with FPGA(Field-Programmable Gate Array)based hardware architecture design to achieve a high throughput,low latency DNN accelerator.The dissertation adopts a hardware-software co-design approach to conduct in-depth research on four levels: hardware-constrained compression algorithms,algorithm hardware coupling optimization,hardware design adapting to compression algorithms and system-level design of object detection accelerators,respectively.(1)In terms of the hardware-constrained compression algorithm,an improved genetic algorithm as search framework is adopt to explore the pruning rate and bit width of each layer on model under hardware constraints.In the pruning stage,a multi-objective optimization strategy is proposed based on model size and workload,which greatly alleviates the problem of uneven pruning.Experiments show that the proposed pruning scheme can achieve up to 80% reduction on the model's computational workload for Res Net50 on the Image Net dataset;In the quantization stage,a few-shot quantization learning strategy is proposed to solve the problem of poor correlation between evaluation and fine-tuning results.Experiment results on CIFAR-10 and Image Net datasets demonstrate the proposed mixed-precision method outperforms the handcrafted uniform bit-width counterparts and other mixed-precision techniques.(2)In terms of the algorithm hardware coupling optimization,by combining the sparsity of pruning and the data independence of quantization,this dissertation proposes the novel ABM-Sp Conv(Accumulate-Before-Multiply Sparse Convolution)computation method.The traditional Multiply Accumulate(MAC)coupled convolutional computation mode is dismantled by combining like terms into a two-stage convolutional operation decoupled from the accumulation and multiplication operations,and then the zero-value operation is skipped according to the sparse encoding of the model weights,which theoretically improves the computational efficiency and parallelism of sparse convolution.(3)In terms of the hardware design adapting to compression algorithms,using the ABM-Sp Conv computation method,this dissertation proposes and designs a heterogeneous sparse convolutional computation unit consisting of a large accumulator array and a small multiplier array.It can perform both accumulation and multiplication stages of convolutional computation independently to balance the utilization of FPGA on-chip logic and DSP resources.Secondly,this dissertation adopts an asynchronous convolution design.Each computing unit has a local buffer and control logic,which can independently perform convolution tasks with different workloads,alleviating the problem of uneven computational load caused by sparse models.Finally,the accelerator is based on a fully parametric design and uses a self-developed automated design space exploration engine to realize the deployment from an embedded platform to a high-performance FPGA board.(4)In terms of the system-level design,this dissertation achieves a YOLOv2-based FPGA real-time object detection accelerator system.Combining compression methods such as operator fusion,pruning and quantization,the YOLOv2 parameters are compressed by 20 times,and the calculation is compressed by 7 times.The compressed model maintains a of 74.45 m AP(mean Average Precision)on the PASCAL VOC 2007 data set.Then,a deep pipelined sparse hardware accelerator architecture including max pooling is designed.Finally,the YOLOv2 model was deployed on the Intel Arria-10 GX1150 FPGA board through parameter space exploration,achieving a real-time detection speed of 72 Frames Per Second(FPS).
Keywords/Search Tags:Deep Neural Network, Hardware and Software Co-Design, Pruning, Quantization, FPGA, Accelerator
PDF Full Text Request
Related items