| With the advent of the era of artificial intelligence,in order to make human life more convenient and fast,more and more people have invested in the upsurge of research on Intelligent products.Artificial neural networks are one of the important ways to achieve intelligence.Among the many algorithms of artificial neural networks,the learning algorithm based on Convolutional Neural Network(CNN)has shown great advantages in many applications such as image classification,speech recognition and natural language processing compared with traditional algorithms.At present,convolutional neural networks are mostly implemented by general-purpose computers,which are not only bulky but also have poor real-time performance,which makes the convolutional neural network have great limitations in the application of mobile devices or edge computing.Therefore,convolutional neural network urgently needs a smaller and faster implementation method.At present,the hardware implementation of convolutional neural networks has gradually become a research hotspot.The hardware implementation platform is roughly classified into a GPU(Graphics Processing Unit),an Application Specific Integrated Circuit(ASIC),and a Field-Programmable Gate Array(FPGA).For the mobile intelligent information processing platform or the convolutional neural network hardware implementation in the field of edge computing,due to the complex and variable application scenarios,low power consumption,fast processing speed and short development cycle are usually required.GPU has a high power consumption,while ASIC has a long development cycle and poor flexibility.They can not meet the application requirements of edge computing very well.Because FPGA can not only realize parallel computing,but also has the characteristics of rich programmable logic resources,strong flexibility,and short development cycle,it can well meet the hardware implementation requirements of convolutional neural networks in the field of edge computing.The convolutional neural network is mainly composed of convolutional layers,sub-sampling layers,fully connected layers.At present,the research on FPGA hardware implementation of convolutional neural network is mainly focused on convolutional layers and sub-sampling layers,focusing on how to accelerate the calculation process of convolutional layers and sub-sampling layers,but less research on FPGA hardware implementation of fully connected layers.Different from the local perception and weight sharing of the convolution layer,because each neuron in the fully connected layer and the input in the upper layer are all interconnected,the computation of the fully connected layer is still huge,so the calculation of the fully connected layer also needs to be accelerated by hardware urgently.This paper mainly focuses on the fully connected layer hardware implementation technology of convolutional neural network based on FPGA.In this paper,the principle and structure of CNN fully connected layer are analyzed in detail.The top-down design idea is adopted to divide the whole system into several modules with relatively independent functions and structures.Each module is described by Verilog HDL.The modules can be properly combined to build the required CNN fully connected layer hardware structure.Among them,the configurable floating-point multiply accumulator is one of the core modules of the whole system,and it is also the key technology of FPGA hardware implementation.It undertakes the general and time-consuming operation in the fully connected layer calculation process-matrix multiplication operation.In practical application,the floating-point multiplier-accumulator can be configured to different degree of parallelism by parameters according to the requirement of logic resource occupancy and operation speed.The design of the configurable floating-point multiply accumulator greatly enhances the flexibility of the fully connected layer in terms of area occupation and operation speed,and achieves a balance between operation speed and area occupation.For the realization of activation function,this paper evaluates and tests the selection and implementation of activation function.Considering the operation accuracy,operation speed and logic resource occupancy,Sigmoid function is selected and approximated by piecewise linear function,which not only has fast operation speed,less logic resource occupancy,but also has less error.In order to change the network topology and the connection between neurons conveniently and flexibly,a storage structure describing the topology and scale of the fully connected layer is constructed,which not only occupies less storage space,but also has high flexibility.At the same time,Verilog HDL code for the FPGA hardware implementation of the fully connected layer is generated by C++ programming.Secondly,aiming at the data storage in the fully connected layer computing process,we can effectively balance the operation bandwidth and storage bandwidth in the fully connected layer computing process by reading and writing the Block RAM on the FPGA in proper parallel.In order to verify the validity of the design,Testbench is compiled and the results are simulated with ModelSim SE software.The simulation results are consistent with expectations.In order to verify the timing characteristics of this design,the Quartus II built-in TimeQuest Timing Analyzer was used to perform timing constraints and static timing analysis.The static timing analysis showed that the highest operating frequency of the system can reach 106.11 MHz.Finally,the design was tested for functionality,performance,and logic resource occupancy on Altera's Cyclone IV E EP4CE115F29C7.The test results show that the design not only has the characteristics of arbitrarily parallel degree of operation,high precision,and flexible connection structure,but also achieves balance in operation speed,operation precision and logic resource occupation,which can meet the application requirements of edge computing. |