Accelerating Framework Of Transformer By Hardware Design And Model Compression Co-Optimization

Posted on:2022-12-17

Degree:Master

Type:Thesis

Country:China

Candidate:P J Qi

Full Text:PDF

GTID:2518306776492864

Subject:Computer Software and Application of Computer

Abstract/Summary:

PDF Full Text Request

Recently,Transformer based models have shown excellent performance.However,due to its huge storage,computing and resource consumption,it cannot be effectively deployed on embedded devices.FPGAs are widely used in accelerating deep learning algorithms because of its high energy efficiency,fast development cycle and reconfigurable.However,FPGas have very limited on-chip memory,which presents a great challenge to model deployment.In addition,various types of FPGA hardware devices gradually emerged.Transformer models with different constraint requirements can be deployed to devices with different computing capabilities.In previous work,the researchers thought less about devices.Therefore,in order to maximize the use of devices,the selection of appropriate device is also worth considering.In order to solve the challenges of the deployment of Transformer on FPGA and the selection of the best device,this paper for the first time tries to propose a hardwaresoftware co-optimization framework,which combines model compression at the software level and FPGA design at the hardware level.Our framework makes trade-off between the model pruning rates and hardware resources for co-optimization.The main work of this paper is as follows:First,this paper proposes an acceleration framework for software and hardware cooptimization.The user can input constraints(latency constraint LC,accuracy constraint AC),model and dataset,the framework can output a compressed model,and the best device to deploy the model.Second,a novel hardware-friendly model pruning method,HP,is proposed in the acceleration framework.It can reduce the redundant parameters of Transformer model while ensuring the accuracy of the model,thus reducing the computation and storage overhead.Third,the sparse matrix storage format is optimized and the new format is called WMark.In this paper,the sparse matrix storage format MBR is optimized by combining the sparse matrix sparse mode of pruning method HP.The optimized format can significantly reduce the memory consumption and has better performance than the commonly used format.Fourth,a FPGA accelerator for sparse model is designed.In this paper,Transformer accelerator is designed based on pruning HP and sparse matrix memory format WMark to solve the memory access conflict problem of sparse matrix parallel.Experimental results show that the weight pruning method HP presented in this paper performs better than others.The sparse matrix storage format WMark optimized in this paper can reduce the memory usage by 1.5�-2.5�.The FPGA accelerator design in this paper can achieve up to 37� acceleration compared to CPU and 1.9�acceleration compared to GPU.The acceleration system in this paper can also find the best equipment under different constraints.

Keywords/Search Tags:

Natural language processing, Transformer, FPGA, model compression, software and hardware co-optimization

PDF Full Text Request

Related items

1	RNN Hardware Implementation And Natural Language Processing Based On FPGA
2	Text Semantic Similarity Algorithm Based On Transformer
3	Research On Natural Language Programming
4	Research On Unsupervised Parsing Based On Transformer Neural Network
5	The Methodology And Implementation Of Chinese Natural Language Query In Databases
6	Emotion Analysis In Natural Language Processing Based On Eye Tracker
7	Natural language program analysis: Combining natural language processing with program analysis to improve software maintenance tools
8	Research On Machine Learning For Natural Language Processing And Transmission
9	Automating natural-language-based processes of software testing
10	Natural Language Processing Aiming To The Core Texts Of Scientific Literature