Font Size: a A A

Exploring End-to-end Speech Recognition Models

Posted on:2021-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:K P YuanFull Text:PDF
GTID:2518306308979499Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,deep learning and machine learning have become increasingly popular and popular in various fields.Speech recognition under the framework of deep learning has also gradually come to the surface,and it combines Hidden Markov Model and deep neural network(HMM-DNN)method is a hybrid model that helps traditional speech recognition models take their accuracy to the next level.Although the recognition effect is good,the hybrid system model still faces many difficulties,such as large computational space for decoding and insufficiently concise training process.In order to solve these difficulties,an end-to-end speech recognition method came into being.Among them,the end-to-end model based on the attention mechanism works best.At present,the mainstream models that perform better include:listening,attention,spelling models(Listen,Attend and Spell,LAS),Transformer models,and deep feedforward sequence memory networks.(DFSMN)model.Although the current end-to-end speech recognition method works well,it has not reached the perfect level.There are still some problems to be studied.This article mainly studies the modeling of end-to-end speech recognition.The specific content is as follows:1.LAS-based end-to-end speech recognition model and improved methods are studied.Based on the LAS model,a corrective training method based on the confusion matrix is proposed,which improves the recognition effect of the model and increases the robustness of the model.In addition,this article also compares the input features of LAS,and determines that the full Mel spectrum feature has the best effect.And on this basis,this article also did some work optimization model training,such as the realization of two different training methods of minimum word error rate and maximum mutual information,using pruning search decoding to optimize network training The effect of improving model recognition accuracy.The research and improvement of encoder effect based on end-to-end model are studied.According to the analysis of this article,the framework principle of LAS is considered to be closest to the human speech recognition system.Therefore,this article chooses LAS as the basic framework,and explores three different(LAS,Transformer,DFSMN)encoders for speech recognition by replacing the encoder Impact of results.Then,for three different encoders,the structure has been improved and optimized:the ConvLSTM structure is used to optimize the pyramid-shaped LSTM structure of LAS,and the effect is improved;based on the encoders of CNN and Transformer,it is proposed The encoder structure of ConvTransformer is explored,and the influence of the depth of each module of the network on the recognition accuracy is explored;the convolution module is used to optimize the DFSMN.In the end,the accuracy of the three encoders exceeded the original LAS baseline system.It is further proved and realized that the convolution invariance of CNN is very effective for the acoustic processing of speech recognition.To sum up,this article first made a further exploration of the framework of LAS itself,and proposed optimization and improvement,and adopted some training techniques to improve the performance of LAS.In addition,the encoders of three mainstream end-to-end speech recognition frameworks(LAS,Transformer,DFSMN)are also explored and improved,and some improvements on the network structure are proposed so that the effects of the three encoders exceed the LAS baseline system.The above work has achieved good results.
Keywords/Search Tags:end-to-end, LAS, Transformer, DFSMN, correcting training, confusion matrix
PDF Full Text Request
Related items