Font Size: a A A

Research On End-to-End Speech Recognition Method Based On Self-Attention Mechanism

Posted on:2021-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z C LeiFull Text:PDF
GTID:2518306497957579Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and artificial intelligence technology,Automatic Speech Recognition(ASR)becomes a key method for human-computer interaction and has been widely used in many practical applications such as smart home,smart wearables and intelligent dialogue system.The end-to-end speech recognition model is simple in structure,flexible in modeling,and requires less memory in the decoding process.It can achieve better recognition results than traditional hybrid models in many application scenarios,and it has become a popular research direction in the field of ASR.In recent years,the Transformer model which only uses Self-attention to model sequences shows a strong capability in sequence modeling and has achieved good results in many natural language processing tasks.Accoding to the research on the end-to-end speech recognition model based on the Self-attention mechanism,this paper proposes an improved end-to-end speech recognition model based on the Self-attention mechanism for the problem that the Self-attention mechanism is difficult to learn the positional relationship and alignment of the speech sequence and the attention weight may be diluted during the speech modeling process.Main research contents are summarized as follows:(1)A SAC model combines Self-attention with Convolutional Neural Network(CNN)is proposed.In order to solve the problem that the Self-attention mechanism cannot model the positional relationship between the sequence,the convolutional neural network is used to replace the sine and cosine positional encoding structure in the original Transformer model to automatically learn the positional relationship of the sequence,and further study the training and decoding tricks of the SAC model.It is verified by experiments that the SAC model can obtain better speech recognition effect than the Transformer model.(2)A hybrid model of CTC/SAC based on Connectionist Temporal Classification(CTC)and SAC is proposed.The SAC model has difficulty in learning the alignment relationship between the speech feature sequence and the output sequence using the Self-attention mechanism,and the CTC can easily model the alignment of the sequence through the Markov assumption and the forward and backward algorithm.Therefore,the multi-task learning technology is used to construct the CTC/SAC hybrid model to combine the modeling advantages of CTC and SAC models,and further realize the joint training and decoding of the CTC/SAC hybrid model.The test results show that the CTC/SAC model has improved convergence speed and recognition accuracy compared to the SAC model.(3)Research on the optimization of CTC/SAC mixed model.The externally trained language model is added to the CTC/SAC hybrid model through shallow fusion to improve the modeling ability of the hybrid model.At the same time,considering that an output unit in speech recognition is mainly related to a few adjacent speech frames,and the attention calculation will be interfered by noise,a way to add Self-attention weight bias terms to optimize the CTC/SAC hybrid model is proposed,to suppress the degree of attention outside the area of interest,and further improve the recognition accuracy of the hybrid model.
Keywords/Search Tags:Automatic Speech Recognition, end-to-end, Self-attention mechanism, Connectionist Temporal Classification, multi-task learning
PDF Full Text Request
Related items