| With the continues development of machine learning,machine learning has become one of the most important method to solve problems in many research fields.Attention-based machine learning algorithm is a kind of emerging machine learning algorithm.In recent years,it has made considerable progress in routine tasks such as intelligent optimization,text recognition,and image processing.Most attention-based machine learning algorithms are implemented in high-level programming languages and are directly deployed on the PC or server clusters.This neglects the parallel data relationship within the algorithm.Reducing the computational performance of the algorithm and causing the algorithm to be effectively used at the edge.Thus,under the premise of retaining the algorithm’s ability to complete tasks,it is a very important topic to deploy attention-based machine learning algorithms on edge terminals.The current mainstream hardware devices are CPU and GPU.They are not suitable for deployment devices for edge machine learning algorithms due to low computing efficiency of infrastructure instructions or the high-power consumption.FPGA,as a complete hardware ecological development device,has the advantages of low power consumption,reconfigurability,customization,and short overall development cycle.A highly optimized FPGA system completes a high degree of pipeline design to achieve extremely high data throughput the amount.This article uses hardware description language to study and implement two types of typical edge terminal deployment of machine learning algorithms with attention based on FPGA.They are respectively oriented to the fields of intelligent optimization and natural language processing.This thesis first studies and implements a meta-heuristic algorithm oriented to the field of intelligent optimization—Beetle Antenna Search(BAS)FPGA hardware architecture.Firstly,this thesis analyzes the algorithm structure of BAS and gives its corresponding hardware architecture.Secondly,this thesis establishes a twin LFSR model in FPGA.It solves the problem that heuristic algorithms cannot handle high-latitude functions in the hardware environment.Finally,this thesis uses the hardware description language to implement the BAS hardware architecture for the first time based on the Xilinx XC7Z010 FPGA chip and discusses the impact of different operating frequencies on the algorithm performance.This architecture optimizes the booth function in FPGA with a delay of 132.5us,which is much more effective than executing high-level language BAS codes on a computer platform.The experimental results verify the correctness of the architecture.After that,this thesis studies and designs a customized hardware architecture for the neural network model Transformer oriented to the field of natural language processing.In this part,this thesis first introduces the algorithm structure of Transformer.Subsequently,according to the introduced Transformer structure,combined with the specific target FPGA platform,this paper introduces the collaborative design process of software and hardware.In this process,this article discusses in detail the attention mechanism operations and corresponding parallel data relationships in each sub-module of Transformer.Combining these data relationships,this article focuses on the analysis of the maximum parallelism of the attention mechanism calculation,and establishes a complete mathematical model based on the specific hardware resources mapped by the algorithm to find the maximum parallelism of the overall hardware architecture.In addition,this paper proposes a multitasking storage strategy to reduce the impact of on-chip and off-chip data transmission bandwidth bottlenecks on computing performance.This thesis uses Verilog HDL to design a Transformer acceleration architecture based on Xilinx ZU9 EG FPGA.The architecture includes two parts: hardware and software.The hardware part is mainly responsible for accelerating calculations,which includes a largescale matrix multiplication module,a layer regularization module,a large-scale matrix addition module,and a feedforward network module.The software part is mainly responsible for data scheduling.For the WMT16 EN DE data set,the experimental results show that at a working frequency of 200 MHz,the overall latency is 90.201 ms,and the resulting power consumption is12.26 W.They are 13.67 times and 8.9 times that of general-purpose CPU and GPU respectively,and the power consumption is only about 1/11 and 1/24 of the corresponding device. |