In recent years,with the development of large data and artificial intelligence,there is a more stringent demand for the real-time performance of algorithms related to intensive computing.The traditional CPU platform is difficult to meet the growing demand of massive data realtime processing for computing power.For specific fields,designing special processor microarchitecture and customizing processor core is a more energy-efficient solution.In the past,custom processors often used X86 or ARM instruction set architecture,but this kind of commercial architecture has problems such as high patent fee and insufficient flexibility.The emerging RISC-Ⅴ architecture can be customized and optimized for different fields at a lower cost due to the advanced design concept and completely open source ecology.Among the computation intensive algorithms,the most typical one is convolution.Whether it is a variety of image filtering algorithms,Fourier transform,DCT transform or CNN network,the most resource consuming operation type is convolution.In this thesis,a custom processor with general convolution acceleration function based on RISC-Ⅴ instruction set architecture is designed,which can accelerate one-dimensional convolution,two-dimensional convolution with various characteristics and neural network.The main processor uses Rocket Core,a 64 bits classic 5-stage pipelined single emission scalar processor developed by Krste Asanovic of the University of Berkeley,which can be configured and generated by Rocket Chip,an open-source So C design platform built by Chisel.The convolution acceleration unit is used as the coprocessor,which is connected with the Rocket Core through the RoCC interface,and the user-defined instruction set is used to control the coprocessor.By analyzing the principles and characteristics of one-dimensional convolution,twodimensional convolution and neural network convolution layer,this paper designs a highly configurable linear PE(process element)array,which can arbitrarily transform the combination of PE units through software,and efficiently accelerate one-dimensional convolution and two-dimensional convolution operations with multiple radii and steps.In order to adapt to different application scenarios and computing power requirements,the number of PE units can be configured through parameters.In the acceleration of neural network convolution layer,three data reuse methods are analyzed in principle.After comparison,it is determined that the data reuse scheme of this paper is: all output feature maps are reused,and the input feature maps and weights are reused as needed.Through the block processing of the output feature map,the problem of too large on-chip cache caused by the reuse scheme is solved.The input characteristic graph cache is designed as a multi row structure,which greatly increases the bandwidth of the input cache to the PE array,and can support two-dimensional convolution of multiple radii.The processing efficiency of the accelerator is improved by ping-pong processing of the input feature map cache.Add multiple output caches to realize 100% reuse of the intermediate feature map under the condition of cooperating with the blocking of the input feature map.The pooling layer is designed by reusing PE units to support maximum pooling and average pooling under the condition of adding a small amount of resources.Based on data addressing,four custom extension instructions of RoCC interface are designed to complete the hardware design of coprocessor with the help of decoder and controller.The function of PE array,the core operation unit of coprocessor,is simulated,onedimensional convolution and various forms of two-dimensional convolution are simulated respectively,and the waveform is analyzed and explained.Use the simulation tool of Rocket Chip to record the total number of cycles spent in code execution.The one-dimensional convolution and convolution layer are simulated respectively.When accelerating onedimensional convolution,the acceleration ratio will change according to the length of onedimensional sequence.When the input sequence is 8096 and 81,the acceleration ratio can be 61 times higher than that of Rocket Core at the same frequency.When accelerating the neural network or two-dimensional convolution,the acceleration ratio is nearly 100 times higher than that of Rocket Core.The Verilog code is transplanted to Vivado environment for synthesis.When the coprocessor works at 200 MHz,81 PE units are configured,and the power consumption is 2.41 W.Through simulation and synthesis,it is verified that the processor can accelerate a variety of convolutions efficiently,and the resources and power consumption are within an acceptable range. |