Recurrent neural networks(RNNs)has been widely used in the fields of Automatic Speech Recognition(ASR)and Natural Language Processing(NLP).In order to accelerate RNN inference,previous works have proposed various optimization methods for RNN.Weight pruning is a widely used optimization strategy,which speeds up the RNN by constructing sparse weight matrices and omitting the calculation and storage of zero elements in the inference process.However,the non-zero elements of the weight matrices after unstructured pruning are randomly distributed,which will lead to unbalanced calculation and memory conflicts and thus fail to reach the ideal speed-up ratio.Bank-Balanced Sparsity(BBS)is an efficient compression method that has a balanced distribution of non-zero elements and negligible precision degradation.However,it costs considerable additional memory overhead to store the indices,which limits the compression ratio.Some works also proposed using the input similarity of time series analyze task to reduce computation,but the similarity check algorithms of these works have the disadvantages of high complexity or large error accumulation,and have not used the sparsity of weight at the same time,so there is a lot of space for optimization.This paper presentes an acceleration scheme of circular neural network that combining balanced sparse and input similarity-based skipping.For weight,a Shared Index Bank-Balanced Sparsity(SIBBS)compression method is presented in this paper.The rows of a weight matrix are divided into multiple bank clusters to balance the non-zero weight distribution.The banks in one cluster share the indices.Compared with the BBS,the cost of the index is reduced by 2-8x,and the accuracy only decreases by 0.9% on Libri Speech and 0.4% on TIMIT.Meanwhile,A coarse-grained inputs similarity skipping algorithm,fixed input similarity-based skipping algorithm,is proposed at the same time to utilize the SIBBS pruning balance.The fixed input similarity-based skipping algorithm compares the similarity between current input and the first input after skipping failure.This method has smaller error accumulation compared with the algorithm based on the similarity of adjacent inputs.In addition,the similarity calculation formula proposed in this paper has the same precision as other formulas but greatly reduces the computational complexity.While reducing LSTM operation by 10% using this algorithm,the accuracy decreases by 0.55%-1.90%on the Libri Speech test set and 0.42%-0.88%on the TIMIT test set,with negligible computational overhead.Finally,an accelerator architecture using SIBBS and fixed input similarity-based skipping algorithm is proposed in this paper.The accelerator includes a sparse matrix vector multiplication unit and a similarity check unit to execute the two algorithms respectively,which can reduce the computational from both the weight matrix and input vector.This accelerator has implemented on Xilinx XCKU115 FPGA.Compared with the most advanced LSTM accelerators based on FPGA,the proposed accelerators achieve a 1.47x-79.5x reduction in latency without accuracy loss.When performing continuous LSTM calculations,the average latency will be lower due to the input similarity-based skipping algorithm. |