| With the widespread application of deep learning in some fields,such as image,video,and speech,neural network algorithms are also rapidly developing,and models’ computational and data access requirements are constantly increasing.Therefore,choosing an appropriate computing platform for various new algorithms is particularly important.Besides CPUs and GPUs,reconfigurable hardware FPGA is gradually becoming an excellent computing platform for balancing power and performance.Due to its reconfigurability,FPGA can customize hardware according to the computing process of different applications,thereby achieving high parallelism and efficiency.Moreover,with the development of semiconductor technology,the hardware resources integrated into FPGA are also increasing.The latest large FPGAs use silicon interposers to integrate multi-die,which can package more logic on a single device to satisfy various computing and memory access requirements of complex applications.However,due to the rapid iteration of the neural network,the computational architecture and operating modes of hardware units still need to be explored in depth to obtain excellent computing performance for new or other parallel algorithms that have not been fully exploited.Meanwhile,besides the design phase,the deployment method of accelerator units on large FPGAs is also essential.Critical issues such as floorplanning methods and cross-die communication lead to suboptimal hardware timing and cannot achieve excellent frequency,affecting actual performance.In order to analyze application programs,design accelerator architecture,and optimize accelerator hardware frequency,architectural designers need to master the professional technical knowledge of large FPGA hardware to obtain more optimized performance.Critical technologies for customizing neural network accelerators for large-scale reconfigurable hardwareIn this dissertation,we conduct an in-depth study on the customization methods of large-scale reconfigurable hardware neural network accelerators,exploring the accelerator performance from three perspectives:performance analysis,architecture design,and deployment optimization.Firstly,we analyze the relationship between parallelism,buffer resources,and bandwidth requirements of the computing kernels based on the characteristics of large FPGAs.Secondly,we propose an efficient accelerator design paradigm by dividing the network model,determining reuse and pipeline allocation,and maximizing the utilization of on-chip resources.Thirdly,we reduce on-chip local congestion,lower the maximum negative timing margin,and improve the frequency performance during accelerator deployment by utilizing optimization methods based on the layout stage in combination with resource distribution and cross-multi-die constraints.Through the three aspects of research,we fully and comprehensively explore the design methods of efficient neural network computing systems.The main contributions of this dissertation are summarized as follows:1.To address the design of large-scale FPGA-based neural network accelerators,this dissertation proposes the DoubleFlow performance analysis method to effectively determine the relationships between different hardware design elements and identify the accelerator design bottleneck.Specifically,the method determines the relationships between off-chip bandwidth,on-chip buffer,and parallel computing in the generic dual-pipeline running form of FPGA accelerators.Then DoubleFlow utilizes the polyhedral model method to transform the application computation scheduling into the dual-pipeline form and models the entire computation and memory access behavior with the characteristics of large-scale FPGA hardware to analyze performance differences and resource costs of different types.2.This dissertation proposes a Multi-Clusters design paradigm for FPGA-based neural network accelerators.The paradigm combines the Overlap and Stream design features to split the computation process of the network model and generate different processing engines for matching.The internal engines use pipeline optimization to improve computation performance and optimize workload balance through design space search methods to maximize computation efficiency.The exterior design uses software scheduling and a coarse-grained pipeline to minimize the scheduling overhead of middleware.3.In this dissertation,we propose a coarse-grained floorplanning frequency boosting method called FrqBooster,which combines the characteristics of large-scale FPGA multi-chip systems.This method effectively analyzes the connection relationships among components in hardware design,searches for suitable component distribution using planning algorithms,and optimizes the frequency impact caused by cross-die communication.Subsequently,we analyze the resource distribution of components in IP core design,combine it with the resource availability of the target FPGA,and perform a two-stage resource balancing optimization design to alleviate local congestion in chip design.Finally,we optimize the overall layout to reduce the worst-case delay performance,thereby improving the final frequency results.4.Based on the above performance analysis and architecture design work,we designed an FPGA-based accelerator called ViA for the new Vision Transformer algorithm using the Multi-Clusters design paradigm,which effectively improves the computational performance of the algorithm.Firstly,we design an efficient data partition strategy based on data locality analysis to reduce the impact of data locality on image data and improve the efficiency of computation and memory access.Secondly,in terms of path dependence,we use half-layer mapping and throughput analysis in the ViA accelerator to reduce the path dependence impact caused by the residual mechanism,saving on-chip pipeline resource costs and improving overall computational efficiency.In summary,this dissertation aims to comprehensively explore the optimization methods for accelerator design from three aspects:accelerator hardware space design and computation scheduling,accelerator architecture design,and frequency optimization during deployment.Furthermore,a novel hardware accelerator customized for new neural network algorithms has been proposed.The DoubleFlow performance analysis method,Multi-Clusters design paradigm,FrqBooster floorplanning optimization,and Vision Transformer accelerator ViA have important research significance and practical value for large-scale reconfigurable hardware-based neural network accelerators. |