Font Size: a A A

Compression And Acceleration Method Of Neural Network For Lightweight And High Energy Efficiency

Posted on:2024-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:B S LiangFull Text:PDF
GTID:2568306914465694Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,deep neural network technology has developed rapidly and made world-renowned achievements in research and application fields.With the deepening of research and application,the scale of deep neural networks has been expanding to obtain better model accuracy and generalization ability.The dramatic increase in computation and the number of parameters brings challenges to developing deep neural networks,such as increased energy consumption,limited types of deployed devices,and the inability to meet real-time requirements.How to compress neural network models,accelerate network inference,and reduce inference energy consumption is one of the current hot issues in academia and industry.LSTM,a deep neural network for processing sequential data,is widely used in natural language processing,speech recognition,and other fields.In this paper,we take LSTM,one of the deep neural networks,as the research object of model compression and inference acceleration.We hope to achieve high-performance and energy-efficient neural network inference by lightening the network and building a heterogeneous computing platform.This paper introduces a knowledge distillation-based pruned model accuracy recovery strategy for LSTM weight pruning.Based on this,a compression method based on the original model and a compression method based on the BERT model is proposed.The experimental results show that in both fine-grained pruning and coarse-grained pruning,under the same sparsity,our compressed model accuracy is higher than that obtained by fine-tuning.In the experiments,this paper combines coarsegrained pruning,model accuracy recovery strategy,and quantization to compress BiLSTM.The experimental results show that our compressed model can achieve a speedup of 2.3 times and an energy efficiency improvement of 2.8 times compared to the original model on the GPU platform without any loss of accuracy.To implement high-performance LSTM inference on FPGA,this paper introduces row-balanced pruning into the compression of LSTM and designs a storage format CBSR and matrix-vector multiplication based on the row-balanced sparsity.The LSTM accelerator is optimized using fixedpoint quantization and reconstructed activation functions.Experimental results show that these optimization methods effectively reduce the FPGA resource utilization and LSTM inference time.Based on this,this paper combines the LSTM accelerator and CPU to build a heterogeneous computing platform to provide computational support for LSTM-based algorithms.Experimental results show that our computing platform can achieve a 2.07 times speedup and 7.6 times energy efficiency improvement compared to GPU and 7.5 times speedup and 17.7 times energy efficiency improvement compared to CPU without losing model accuracy.Also,the experimental results validate the rationality of building a heterogeneous computing platform.
Keywords/Search Tags:lstm, knowledge distillation, weight pruning, fpga, heterogeneous computing
PDF Full Text Request
Related items