Research On Hardware Implementation Technology Of CNN Fully Connected Layers Based On FPGA

Posted on:2020-12-24

Degree:Master

Type:Thesis

Country:China

Candidate:Q Shao

Full Text:PDF

GTID:2428330596970716

Subject:Circuits and Systems

Abstract/Summary:

With the advent of the era of artificial intelligence,in order to make human life more convenient and fast,more and more people have invested in the upsurge of research on Intelligent products.Artificial neural networks are one of the important ways to achieve intelligence.Among the many algorithms of artificial neural networks,the learning algorithm based on Convolutional Neural Network(CNN)has shown great advantages in many applications such as image classification,speech recognition and natural language processing compared with traditional algorithms.At present,convolutional neural networks are mostly implemented by general-purpose computers,which are not only bulky but also have poor real-time performance,which makes the convolutional neural network have great limitations in the application of mobile devices or edge computing.Therefore,convolutional neural network urgently needs a smaller and faster implementation method.At present,the hardware implementation of convolutional neural networks has gradually become a research hotspot.The hardware implementation platform is roughly classified into a GPU(Graphics Processing Unit),an Application Specific Integrated Circuit(ASIC),and a Field-Programmable Gate Array(FPGA).For the mobile intelligent information processing platform or the convolutional neural network hardware implementation in the field of edge computing,due to the complex and variable application scenarios,low power consumption,fast processing speed and short development cycle are usually required.GPU has a high power consumption,while ASIC has a long development cycle and poor flexibility.They can not meet the application requirements of edge computing very well.Because FPGA can not only realize parallel computing,but also has the characteristics of rich programmable logic resources,strong flexibility,and short development cycle,it can well meet the hardware implementation requirements of convolutional neural networks in the field of edge computing.The convolutional neural network is mainly composed of convolutional layers,sub-sampling layers,fully connected layers.At present,the research on FPGA hardware implementation of convolutional neural network is mainly focused on convolutional layers and sub-sampling layers,focusing on how to accelerate the calculation process of convolutional layers and sub-sampling layers,but less research on FPGA hardware implementation of fully connected layers.Different from the local perception and weight sharing of the convolution layer,because each neuron in the fully connected layer and the input in the upper layer are all interconnected,the computation of the fully connected layer is still huge,so the calculation of the fully connected layer also needs to be accelerated by hardware urgently.This paper mainly focuses on the fully connected layer hardware implementation technology of convolutional neural network based on FPGA.In this paper,the principle and structure of CNN fully connected layer are analyzed in detail.The top-down design idea is adopted to divide the whole system into several modules with relatively independent functions and structures.Each module is described by Verilog HDL.The modules can be properly combined to build the required CNN fully connected layer hardware structure.Among them,the configurable floating-point multiply accumulator is one of the core modules of the whole system,and it is also the key technology of FPGA hardware implementation.It undertakes the general and time-consuming operation in the fully connected layer calculation process-matrix multiplication operation.In practical application,the floating-point multiplier-accumulator can be configured to different degree of parallelism by parameters according to the requirement of logic resource occupancy and operation speed.The design of the configurable floating-point multiply accumulator greatly enhances the flexibility of the fully connected layer in terms of area occupation and operation speed,and achieves a balance between operation speed and area occupation.For the realization of activation function,this paper evaluates and tests the selection and implementation of activation function.Considering the operation accuracy,operation speed and logic resource occupancy,Sigmoid function is selected and approximated by piecewise linear function,which not only has fast operation speed,less logic resource occupancy,but also has less error.In order to change the network topology and the connection between neurons conveniently and flexibly,a storage structure describing the topology and scale of the fully connected layer is constructed,which not only occupies less storage space,but also has high flexibility.At the same time,Verilog HDL code for the FPGA hardware implementation of the fully connected layer is generated by C++ programming.Secondly,aiming at the data storage in the fully connected layer computing process,we can effectively balance the operation bandwidth and storage bandwidth in the fully connected layer computing process by reading and writing the Block RAM on the FPGA in proper parallel.In order to verify the validity of the design,Testbench is compiled and the results are simulated with ModelSim SE software.The simulation results are consistent with expectations.In order to verify the timing characteristics of this design,the Quartus II built-in TimeQuest Timing Analyzer was used to perform timing constraints and static timing analysis.The static timing analysis showed that the highest operating frequency of the system can reach 106.11 MHz.Finally,the design was tested for functionality,performance,and logic resource occupancy on Altera's Cyclone IV E EP4CE115F29C7.The test results show that the design not only has the characteristics of arbitrarily parallel degree of operation,high precision,and flexible connection structure,but also achieves balance in operation speed,operation precision and logic resource occupation,which can meet the application requirements of edge computing.

Keywords/Search Tags:

edge computing, convolutional neural network, fully connected layer, FPGA, matrix multiplication

Related items

1	Parallel Computing Of Fully Connected And Convolutional Neural Networks Using COStream
2	Performance Evaluation Of Matrix Multiplication Based On FPGA In Fully Connected Dnn Forward Propagation
3	Research On Partial Information-Based Deen Learning Methods
4	Gaze Detection Based On Deep Neural Netword With Selective Fully Connected Layers
5	Trust Evaluation Method Based On Trusted Computing In Edge Computing
6	Implementation And Optimization Of Fully Connected Neural Network On FPGA
7	Three-dimensional Human Pose Estimation Based On Fully Connected Neural Network
8	Research On Hardware Acceleration Method Of Deep Convolutional Neural Network Based On FPGA
9	Research On Key Technologies Of High Performance Accelerator For Convolution And Recurrent Neural Networks
10	The Research On Single Object Tracking Based On Siamese Fully-Convolutional Neural Network