Research And Implementation Of High-speed Object Detection Network Based On FPGA Accelerator

Posted on:2022-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:S C Ma

Full Text:PDF

GTID:2518306605465134

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Recently,the research of object detection neural network has made vast progress.It has some applications in smart cities and autonomous driving scenarios,and promote productivity development.However,the object detection neural network relies on massive data,which has high requirements for computing platform to provide enough throughput.This research is to be introduced on how to compute object detection neural network efficiently.At present,computing platforms can be divided into general computing platforms and customized computing platforms.General computing platforms such as CPU and GPU adopt a control-flow design architecture to have a unified management of computing and storage modules.Data reading and writing must pass through slower external storage,resulting in a �memory wall� bottleneck in computing.In most cases,general computing platforms accelerate computing by increasing the frequency.And it is impossible to design dedicated circuits for network,resulting in low Energy-Efficiency Ratio and can not be used in some scenarios.Therefore,this work chooses the customized platform solution.It builds a dataflow architecture on FPGA accelerator.In this way,a dedicated circuit can be designed to achieve high-performance computing of network on inference side.It also reaches a balance of computing power and energy efficiency.In summary,the contributions of this work are as follows:Firstly,this work is based on data-flow architecture to reduces external memory access.Each execution unit in accelerator has its own independent control mode.In this way,the operation unit transfers data flexibly among operation unit and on-chip storage,avoiding to load external memory frequently during computing.Consequently,on-chip data can be reused and communication costs can be reduced.Secondly,this work deploys a multi-level storage system to minimize the data transmission path.It uses memory with different access costs,e.g.off-chip DDR,on-chip Block RAM,and distributed CLB,to build a multi-level storage system and combine the system with dataflow of network.The data is rearranged in the system,transferred by pipelining,and finally cached in the distributed CLB storage to realize the close-range computing of data.Thirdly,this work deploys a systolic array to accelerate convolution computing.The systolic array consists of PE elements with different modes and positions.During the computing,all PE units in the array work at the same time,and the data is transferred systolically.In this way,computing can be accelerated parallelly.The design of systolic array accords with the characteristics of convolution computing,which not only realizes the parallel operation between different channels and different positions,but also improves the throughput of accelerator.Fourthly,this work designs a dedicated instruction set to compatible with different network.The instruction set focuses on the object detection neural network and uses a dedicated design scheme to control computing and storage module.In the coarse-grained control mode,four types of instructions are designed for transmission,operation,control,and shift,which are used for mapping between network and circuit modules.The accelerator has strong compatibility and can suit different object detection networks through instruction set quickly.Finally,this work presents INT8/INT16 data quantization schemes applying in different scenarios.Following the thought of Dynamic fixed point number,this work maps the highdensity data to low-density.The INT16 data width plan has higher accuracy.The INT8 scheme uses lower-bit network,which can double the accelerator's peak throughput at a small expense of detection accuracy.The two types of computing solutions have the same architecture,similar idea and use the data bandwidth for different needs efficiently,demonstrating simplicity and flexibility of this design.This work detects three-channel input image of size 256�256 with YOLOV3-Tiny network on Xilinx Virtex7 690 T chip.When it runs in INT16 data width,accelerator's peak throughput is 154 GOPS.The power is 5.16 W,and the Energy-Efficiency Ratio is 29.8GOP/W.When it runs in INT8 data width,accelerator's peak throughput is 308 GOPS.The power is 7.95 W,and the Energy-Efficiency Ratio is 38.7GOP/W.This project is designed with Verilog HDL language,and has the same calculation result of the binary data as software.This design has meaningful application value with its controllable design and reliable calculation.

Keywords/Search Tags:

Objection Detection Networks, FPGA Accelerator, Instruction Set, Systolic Array, Multi-level storage

PDF Full Text Request

Related items

1	Design And Implementation Of RSA Public-Key Cryptographic Coprocessor Based On Systolic Array Architecture
2	Research On Convolutional Neural Networks Accelerator Based On FPGA
3	Design Of Multi-camera Array Image Display And Storage System
4	An Efficient General Accelerator For Convolutional Neural Network
5	Design Of Blind Equalizer And Multi-user Detection Based On Systolic Array
6	High-efficiency Reconfigurable Array Computing: Architecture, Methodology And Application Mapping Technology
7	Research On MIMO Signal Detection Algorithm Based On Systolic Array
8	Design And Optimization Of Convolution Array Accelerator Based On FPGA
9	VHDL Implementation of PPR Systolic Array Architecture for Polynomial GF(2m) Multiplication
10	FPGA design and implementation of systolic array-based Viterbi decoders