Font Size: a A A

Research On CTC-based And Attention-based End-to-end Speech Recognition

Posted on:2020-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2428330590972673Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous generation of exponentially growing mass voice data,the demand for speech recognition in various fields such as industry,agriculture and military is increasing day by day,and higher requirements for the accurate and efficient identification of large-scale speech signals are put forward.In recent years,end-to-end speech recognition has become a hot research direction in the field of speech recognition.Compared with the traditional HMM-based hybrid model,the end-to-end speech recognition overcomes the problem of relatively independent acoustic,pronunciation and language models in the HMM-based hybrid model,and achieves global unified optimization.On the other hand,the forced alignment of the state and the construction of the pronunciation dictionary are not required,which greatly reduces the complexity of the model construction.The thesis focuses on improving the accuracy and training efficiency of end-to-end speech recognition,and conducts research on the two technical approaches of current end-to-end speech recognition—CTC-based and attention-based end-to-end speech recognition.The main task and innovations are as follow:1.Aiming at the problem of excessive training period and insufficient model depth caused by recurrent neural networks in CTC-based end-to-end speech recognition model,we carefully analyze the group residual convolutional networks and sequence-wise batch-normalization,innovatively apply the group residual convolutional networks to construct CTC-based speech recognition model and build a CTC-based speech recognition model based on group residual convolution networks—GRCNN-CTC.In the group residual convolutional networks,the large receptive field brought by depth and the stable convergence brought by residual structure can replace the recurrent neural networks to the temporal modeling of long-term related speech features to some extent.The experimental results show that the CTC-based speech recognition model based on group residual convolution networks can greatly shorten the training period while improving the recognition accuracy.2.For the problem of inaccurate alignment between decoder state vector and encoder state vector,insufficient representation of decoding network input features and poor generalization performance caused by One-hot encoding,our proposed model innovatively uses three techniques to improve the recognition performance and generalization ability of the attention-based end-to-end speech recognition model.The first is to build an end-to-end speech recognition model under the support of Multi-Head attention mechanism technology—Multi-Head LAS.By mapping the feature vector to different representation subspaces,it is possible to calculate the correlation between the current decoder state vector and the encoder state vector from multiple dimensions,and obtain more accurate alignment information.The second is to improve the input stream of decoder by using the Input-feeding method,it replaces the context information vector of the previous moment with the hidden layer state vector of the multi-layer perceptron at the previous moment to improve the input feature representation capability.The third is to use the label smoothing regularization technology to introduce label noise to achieve constraint on the model and reduce the degree of model over-fitting.The experimental results show that the improved attention-based end-to-end speech recognition model using the three corresponding techniques can effectively improve the recognition performance and generalization ability of the model.
Keywords/Search Tags:speech recognition, deep learning, connectionist temporal classification, attention mechanism, group residual convolutional networks
PDF Full Text Request
Related items