Font Size: a A A

Accelerating Framework Of Transformer By Hardware Design And Model Compression Co-Optimization

Posted on:2022-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:P J QiFull Text:PDF
GTID:2518306776492864Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Recently,Transformer based models have shown excellent performance.However,due to its huge storage,computing and resource consumption,it cannot be effectively deployed on embedded devices.FPGAs are widely used in accelerating deep learning algorithms because of its high energy efficiency,fast development cycle and reconfigurable.However,FPGas have very limited on-chip memory,which presents a great challenge to model deployment.In addition,various types of FPGA hardware devices gradually emerged.Transformer models with different constraint requirements can be deployed to devices with different computing capabilities.In previous work,the researchers thought less about devices.Therefore,in order to maximize the use of devices,the selection of appropriate device is also worth considering.In order to solve the challenges of the deployment of Transformer on FPGA and the selection of the best device,this paper for the first time tries to propose a hardwaresoftware co-optimization framework,which combines model compression at the software level and FPGA design at the hardware level.Our framework makes trade-off between the model pruning rates and hardware resources for co-optimization.The main work of this paper is as follows:First,this paper proposes an acceleration framework for software and hardware cooptimization.The user can input constraints(latency constraint LC,accuracy constraint AC),model and dataset,the framework can output a compressed model,and the best device to deploy the model.Second,a novel hardware-friendly model pruning method,HP,is proposed in the acceleration framework.It can reduce the redundant parameters of Transformer model while ensuring the accuracy of the model,thus reducing the computation and storage overhead.Third,the sparse matrix storage format is optimized and the new format is called WMark.In this paper,the sparse matrix storage format MBR is optimized by combining the sparse matrix sparse mode of pruning method HP.The optimized format can significantly reduce the memory consumption and has better performance than the commonly used format.Fourth,a FPGA accelerator for sparse model is designed.In this paper,Transformer accelerator is designed based on pruning HP and sparse matrix memory format WMark to solve the memory access conflict problem of sparse matrix parallel.Experimental results show that the weight pruning method HP presented in this paper performs better than others.The sparse matrix storage format WMark optimized in this paper can reduce the memory usage by 1.5×-2.5×.The FPGA accelerator design in this paper can achieve up to 37× acceleration compared to CPU and 1.9×acceleration compared to GPU.The acceleration system in this paper can also find the best equipment under different constraints.
Keywords/Search Tags:Natural language processing, Transformer, FPGA, model compression, software and hardware co-optimization
PDF Full Text Request
Related items