Exploring End-to-end Speech Recognition Models

Posted on:2020-09-04

Degree:Master

Type:Thesis

Country:China

Candidate:R C Fan

Full Text:PDF

GTID:2428330575456408

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the increasement of computing power and big data,deep learning has become one of the most popular methods in the field of speech recognition.Under the influence of deep learning,the performance of hybrid system for speech recognition has been greatly improved.This is mainly due to the strong modeling capabilities of deep neural networks for acoustic state posterior probabilities.However,the hybrid systems still face the issues of complex training process and large decoding space.In order to simplify the whole pipline of speech recognition,end-to-end models were proposed.End-to-end models mainly fall into three categories:Connectionist Temporal Classification(CTC),Recurrent Neural Network Transducer(RNN-T)and attention-based end-to-end model(A-E2E).Although end-to-end models simplify the process of speech recognition and are already comparable to hybrid systems in certain tasks,there are still many problems.This thesis will explore CTC model and the problems in attention-based models and give corresponding solutions as follows:Firstly,we explore the CTC model and propose a Full-Mel spectrum feature that accords with human auditory characteristics.At the same time,this feature is combined with convolutional neural network(CNN)as the front-end processing network of speech.We also explore the role of shallow CNN and its parameters design principles in the CTC model at the same time.Secondly,we implement the framework of listener,attender and speller(LAS),an attention-based model.The thesis will detail our tricks used in the LAS model including both training and decoding for better performance.In addition,we come up with a new idea to add word level language model during beam search decoding.Next,we explore the discriminative training of LAS.Learning from the traditional speech recognition systems,the Minimum Word Error Rate(MWER)based discriminative training is implemented,and the Maximum Mutual Information(MMI)criterion of LAS is proposed.Experiments show that both MMI and MWER criterion can improve the performance of LAS compared to the cross entropy criterion,which causes mismatch between system loss and evaluation criterion.Finally,we make an in-depth study on the streaming LAS model.On the encoder side,we propose to use latency-controlled structure to reduce the latency in encoder.Meanwhile,we propose an Adaptive Monotonic Chunk-wise Attention(AMoChA)mechanism to stream the attention part.On the 1k hours Mandarin dictation corpus,we reduce the latency of the LAS and achieve an acceptable relative 3.5%degradation of CER compared to the offline LAS model.In summary,this thesis explores two end-to-end speech recognition models(CTC and LAS).The problems existing in the LAS,such as integrated external language model,discriminative training and high latency are explored and we achieve some good results with our proposed methods.

Keywords/Search Tags:

end-to-end, ctc, las, discrimiative training, online speech recognition

PDF Full Text Request

Related items

1	Research On Online Tibetan Speech Recognition System
2	Network training for continuous speech recognition
3	Research On Discriminative Training In Speech Recognition
4	The Extension And Selection Of Training Samples For Speech Keyword Recognition And The Implementation Of The System
5	Research And Application On Multitask Speech Recognition Algorithm For Accented Dialogue Systems
6	Research On Mongolian Online Speech Recognition With Scarce Data Set
7	Discriminative Training For Continuous Speech Recognition
8	Reasearch On Cross Corpus Speech Emotion Recognition Based On Domain Adversarial Training
9	Integration of multiple knowledge sources in speech recognition using minimum error training
10	Speech recognition: The interpretation of training and using speech recognition software from the perspectives of postsecondary students with learning challenges