Font Size: a A A

Research On Speech Recognition Based On Transformer

Posted on:2021-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:H Z ZhouFull Text:PDF
GTID:2428330611467483Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Ever since people can produce and use various machines,there has been a great idea of allowing various machines to "hear" and understand human language,and to enable machines to act in accordance with human language commands to realize real people.Machine interaction.In recent years,with the continuous development of big data,neural networks and deep learning,speech recognition technology based on this principle has become more and more mature,which makes people's vision to see the possibility of realization.Voice recognition technology is essentially a technology that allows machines to convert voice information into corresponding text or commands through recognition and understanding applications.The advantage of this technology is that it greatly improves people's daily work efficiency.At present,there are two main frameworks for speech recognition: RNN(Recurrent Neural Network)and Transformer framework,which are used in ASR(Automatic Speech Recognition),ST(Speech Translation)and TTS(Convert Text to Speech)and other aspects.However,there are some shortcomings in the RNN speech recognition model: 1.Under multiple GPUs,the RNN model has no obvious performance improvement;2.The RNN model has a low character accuracy and a low recognition speed in a variety of speech environments.The problem;3.TheL1 loss of the RNN model is seriously too large(the L1 loss is also called the average absolute error,which is too large to cause a gradient explosion);4.The Transformer model also has under-fitting(the model is not well-fitted,the data Distance from the fitting curve)and Transformer decoding filter efficiency is too low.In response to the above problems,the research work of this article is as follows:(1)For the problem that the RNN model has no obvious performance improvement under multiple GPUs,this paper introduces the Transformer speech recognition model,and first sets up a control group: Transformer and RNN models have accurate character verification under 1GPU,2GPU and 4GPU respectively Rate,and finally concluded: under multi-GPU,Transformer performance is better than RNN.(2)Aiming at the problems of low character accuracy and fast recognition speed in multiple speech environments of RNN model,this paper sets 15 data sets,RNN model adopts Adadelta algorithm,Transformer model adopts default configuration,and then conducts ASR experiment operation After a certain period,the experimental results show that in the 13/15 corpus,Transformer's recognition effect is better than RNN.(3)In view of the problem that the L1 loss of the RNN model is too large,this paper sets two corpora,and then TTS experiments under a single GPU and record the L1 loss rate.The experimental results show that Transformer can compare with RNN under large and small batch data sets.To better verify L1 loss,the number of GPUs will also affect Transformer's L1 loss rate.(4)In view of the problems of underfitting and low efficiency of the decoding filter in Transformer,this paper adds a small batch of data sets to prevent its underfitting and shorten the training time;at the same time,this paper uses Fast Speech system to perform Transformer's TTS experiment,Which greatly improves Transformer's decoding filter efficiency,but makes it better than RNN.The innovation of this article is to save the experimental cost.When the experimental environment of multiple GPUs cannot be realized,the cumulative gradient strategy method is used to simulate large and small batch data sets to conduct experiments;in order to obtain more accurate experimental results,data enhancement technology is used To optimize the experimental results;and introduce the reduction coefficient into the Transformer model,which greatly reduces the training time.
Keywords/Search Tags:automatic speech recognition, recurrent neural network, Transformer, speech translation, text-to-speech
PDF Full Text Request
Related items