Font Size: a A A

Research On End-to-end Speech Recognition Based On Deep Learning

Posted on:2020-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiFull Text:PDF
GTID:2438330626953258Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In recent years,end-to-end models based on deep learning are widely adopted in speech recognition.In these models,the mapping relationship between acoustic feature sequence and output graphemes is built by the model itself without any artificial forced alignment.Therefore,compared with traditional hybrid models,end-to-end models can provide the space for data adjustment and make the model more compatible overall.However,lots of experiments show that end-to-end models usually use massive training data for training before reaching the same recognition effects as hybrid models.The fundamental reason is that the structure and design algorithms of current end-to-end models still leave much to be improved.End-to-end models mainly include encoder-decoder model based on attention mechanism and CTC model.A large portion of this paper describes this and offers improvement and innovation to solve problems in the current attention mechanism and CTC model.Specific works are listed as follows:1.Targeting the problem that the hybrid attention mechanism based on convolution location information can't fully consider the location information at multiple past time points,this paper proposes the hybrid attention mechanism that adopts LSTM units.First,convolution kernels were used to extract multi-channel feature maps from the current attention score distribution;then,global average pooling was employed to aggregate the feature maps in each channel to generate a vector with a fixed dimension;finally,this vector was input as the current moment of LSTM to get the location vector for the next moment attention score generation.This paper adopts the classic LAS model to evaluate the new attention mechanism.The final experiment results show that the improved model records the lowest label error rate in pure and noisy speech test sets,down by 1.8% and 2.21% respectively compared with the LAS models based on convolution location information.2.Through piling up a multi-layer recurrent neural network,CTC model can achieve a better recognition effect.However,the multi-layer recurrent neural network structure will bring about a serious problem: vanishing gradient.To tackle this problem,this paper proposes the deep acoustic model that adopts the densely connected recurrent neural network.This model improved the structure of the classic Deep Speech 2 model and introduced the densely connected recurrent neural network to make the transmission of features and gradients more efficient.The final experiment results show that the improved model achieves the lowest label error rate in the middle-sized Chinese speech data set,down by 5.21% and 3.68% respectively in the training set and test set compared with Deep Speech 2.
Keywords/Search Tags:Attention mechanism, CTC model, LSTM, LAS model, Deep Speech 2 model, Densely connected recurrent neural network
PDF Full Text Request
Related items