Font Size: a A A

Research On Continuous Speech Recognition System Based On Transformer

Posted on:2022-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:F JianFull Text:PDF
GTID:2518306575964139Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Automatic speech recognition can converts speech signals into text signals.The acoustic model which is based on the Deep Feed-forward Sequence Memory Network(DFSMN)gradually replaces the Bidirectional Long and Short Memory Network(BLSTM)amid in the deep learing,and the language model Transformer also stands out as the mainstream model in Natural Language Processing(NLP).In this thesis,DFSMN is used as an acoustic model and the Transformer model is introduced to transform speech recognition into translation task for in-depth study,which has certain theoretical significance and research value.Firstly,this thesis describes several mainstream deep learning models in the field of speech recognition.The general scheme of Transformer-based continuous speech recognition system is designed which is according to the deep learning theory.The main contributions of this thesis are analysing the shortcomings of the speech feature extraction method,proposing the language model Transformer,optimizing the feature extraction method and the language model.Secondly,Mel Frequency Cepstrum Coefficient(MFCC)demonstrates weak representation of speech information in depth model,a log Mel Filter-bank(Fbank)feature combined with Convolutional Neural Network(CNN)re-extraction method is proposed,and the acoustic model CNN-DFSMN which is combined with DFSMN is to realize the speech transcription task.The experimental results show that the feature re-extraction method based on Fbank features and CNN has a better speech information extraction ability and lower Character Error Rate(CER)of the model compared with other feature extraction methods.Then,an improved method of attention calculation based on Hadamard matrix is proposed to address the problems overcome the high computational cost of the Transformer and the low recognition rate due to the insufficient generalization ability of the model.Our approach calculate new atterntion matrix by point multiplication which is calculated by Hadamard matrix that is obtaing by setting different thresholds and attention matrix.The experimental results show that the recognition time and CER of the language model are reduced thanks to the improved Transformer model by using Hadmard matrix compared with the initial Transformer model.Finally,the CNN-DFSMN structure is used as the acoustic model,the improved Transformer structure is utilized as the language model to construct the CNN-DFSMN-T speech recognition system,the Connectionist Temporal Classification(CTC)is introduced to construct the CNN-DFSMN-CTC end-to-end system.The experimental validations and comparative analysis are conducted in four Chinese corpora,such as Aidatatang and Magicdata,respectively.The experimental results show that the CER of the CNN-DFSMNT system is 11.8%,which is 3.2% lower than that of the DFSMN-3gram system;the CER of the CNN-DFSMN-CTC system is 17%,which is 2.2% lower than that of the CNNBLSTM-CTC system.The feasibility of the continuous speech recognition system designed in this paper is verified.
Keywords/Search Tags:speech recognition, CNN, DFSMN, transformer, end-to-end
PDF Full Text Request
Related items