Research On End-to-end Speech Recognition Based On Deep Learning

Posted on:2020-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Li

Full Text:PDF

GTID:2438330626953258

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

In recent years,end-to-end models based on deep learning are widely adopted in speech recognition.In these models,the mapping relationship between acoustic feature sequence and output graphemes is built by the model itself without any artificial forced alignment.Therefore,compared with traditional hybrid models,end-to-end models can provide the space for data adjustment and make the model more compatible overall.However,lots of experiments show that end-to-end models usually use massive training data for training before reaching the same recognition effects as hybrid models.The fundamental reason is that the structure and design algorithms of current end-to-end models still leave much to be improved.End-to-end models mainly include encoder-decoder model based on attention mechanism and CTC model.A large portion of this paper describes this and offers improvement and innovation to solve problems in the current attention mechanism and CTC model.Specific works are listed as follows:1.Targeting the problem that the hybrid attention mechanism based on convolution location information can't fully consider the location information at multiple past time points,this paper proposes the hybrid attention mechanism that adopts LSTM units.First,convolution kernels were used to extract multi-channel feature maps from the current attention score distribution;then,global average pooling was employed to aggregate the feature maps in each channel to generate a vector with a fixed dimension;finally,this vector was input as the current moment of LSTM to get the location vector for the next moment attention score generation.This paper adopts the classic LAS model to evaluate the new attention mechanism.The final experiment results show that the improved model records the lowest label error rate in pure and noisy speech test sets,down by 1.8% and 2.21% respectively compared with the LAS models based on convolution location information.2.Through piling up a multi-layer recurrent neural network,CTC model can achieve a better recognition effect.However,the multi-layer recurrent neural network structure will bring about a serious problem: vanishing gradient.To tackle this problem,this paper proposes the deep acoustic model that adopts the densely connected recurrent neural network.This model improved the structure of the classic Deep Speech 2 model and introduced the densely connected recurrent neural network to make the transmission of features and gradients more efficient.The final experiment results show that the improved model achieves the lowest label error rate in the middle-sized Chinese speech data set,down by 5.21% and 3.68% respectively in the training set and test set compared with Deep Speech 2.

Keywords/Search Tags:

Attention mechanism, CTC model, LSTM, LAS model, Deep Speech 2 model, Densely connected recurrent neural network

PDF Full Text Request

Related items

1	Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model
2	Question Classification Based On Deep Learning Model
3	Research On Mongolian Speech Recognition Acoustic Model Based On Deep Learning
4	Research On Deep Recommendation Model Based On Recurrent Neural Network
5	Text Classification Research Based On Deep Neural Network And Attention Mechanism
6	Research And Implementation Of 3D Objects Reconstruction Based On Recurrent Neural Networks
7	Hand Gesture Recognition Method Based On Recurrent Three Dimensional Convolutional Neural Network And Attention Mechanism
8	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
9	Research On Text Classification Model Based On BGRU And Self-Attention Mechanism
10	Research On Apple Sales Forecast Method Based On Deep Learning