Font Size: a A A

Exploring End-to-end Speech Recognition Models

Posted on:2020-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:R C FanFull Text:PDF
GTID:2428330575456408Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the increasement of computing power and big data,deep learning has become one of the most popular methods in the field of speech recognition.Under the influence of deep learning,the performance of hybrid system for speech recognition has been greatly improved.This is mainly due to the strong modeling capabilities of deep neural networks for acoustic state posterior probabilities.However,the hybrid systems still face the issues of complex training process and large decoding space.In order to simplify the whole pipline of speech recognition,end-to-end models were proposed.End-to-end models mainly fall into three categories:Connectionist Temporal Classification(CTC),Recurrent Neural Network Transducer(RNN-T)and attention-based end-to-end model(A-E2E).Although end-to-end models simplify the process of speech recognition and are already comparable to hybrid systems in certain tasks,there are still many problems.This thesis will explore CTC model and the problems in attention-based models and give corresponding solutions as follows:Firstly,we explore the CTC model and propose a Full-Mel spectrum feature that accords with human auditory characteristics.At the same time,this feature is combined with convolutional neural network(CNN)as the front-end processing network of speech.We also explore the role of shallow CNN and its parameters design principles in the CTC model at the same time.Secondly,we implement the framework of listener,attender and speller(LAS),an attention-based model.The thesis will detail our tricks used in the LAS model including both training and decoding for better performance.In addition,we come up with a new idea to add word level language model during beam search decoding.Next,we explore the discriminative training of LAS.Learning from the traditional speech recognition systems,the Minimum Word Error Rate(MWER)based discriminative training is implemented,and the Maximum Mutual Information(MMI)criterion of LAS is proposed.Experiments show that both MMI and MWER criterion can improve the performance of LAS compared to the cross entropy criterion,which causes mismatch between system loss and evaluation criterion.Finally,we make an in-depth study on the streaming LAS model.On the encoder side,we propose to use latency-controlled structure to reduce the latency in encoder.Meanwhile,we propose an Adaptive Monotonic Chunk-wise Attention(AMoChA)mechanism to stream the attention part.On the 1k hours Mandarin dictation corpus,we reduce the latency of the LAS and achieve an acceptable relative 3.5%degradation of CER compared to the offline LAS model.In summary,this thesis explores two end-to-end speech recognition models(CTC and LAS).The problems existing in the LAS,such as integrated external language model,discriminative training and high latency are explored and we achieve some good results with our proposed methods.
Keywords/Search Tags:end-to-end, ctc, las, discrimiative training, online speech recognition
PDF Full Text Request
Related items