Font Size: a A A

Application Research Of Attention-based End-to-end Speech Recognition

Posted on:2021-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:B J LiuFull Text:PDF
GTID:2428330611466448Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the improvement of computing performance and the development of big data,the application of deep learning greatly decreases the error rate of speech recognition systems,making the Hidden Markov Model-Deep Neural Network(HMM-DNN)based systems become mainstream.In recent years,end-to-end speech recognition methods have attracted extensive attention.Different from the traditional method,which requires a series of training methods to obtain frame-level alignment labels,end-to-end methods directly train the mapping relationship between speech features and text,simplifying the training process of the speech recognition model.End-to-end recognition methods can be roughly divided into two categories: Connection Temporary Classification(CTC)method,which directly trains frame-level alignment,and attention-based method,which focuses on the correspondence between feature sequence and text sequence.This work mainly focuses on end-to-end speech recognition models based on attention mechanism.The main research results include:1.End-to-end speech recognition models tend to overfit when training data is limited because they often have a large number of parameters.As a result,these models performs worse than the traditional hybrid models.This research,based on the 8-hour small-scale English dataset Timit,proposes a set of parallel modeling methods and model structures called TDNNTransformer structure and introduces LDA(Linear Discriminant Analysis)to reduce the training difficuly of feature.In this way,the proposed model performance is close to the traditional methods in low-resource data set.2.Due to the global dependence of attention mechanism,streaming real-time speech recognition cannot be supported.This research focus on the problem about online streaming decoding existing in practical engineering applications,and a model based on multi-head monotonic chunkwise attentions is proposed for fast streaming decoding.After using insertion pooling to further improve the performance,the proposed model obtained better recognition performance than the traditional model on Tencent's 18,000-hour vehicle-based internal dataset,making commercial applications possible.In addition,on the 100-hour Aishell-1 Chinese dataset,the proposed model can still achieve similar recognition performance compared with other existing models.3.Multilingual speech recognition is another challenge today.End-to-end models can make better use of contextual information and improve the accuracy of mixed language.This research proposed an improved LAS(Listen,Attend and Spell)structure,combined with the BPE(Byte Pair Encoding)algorithm,and a batch training method based on probability sampling,which effectively improves the end-to-end models' performance on Chinese-English multilingual speech recognition.In the Chinese-English multilingual challenge organized by the ASRU(Automatic Speech Recognition and Understanding)conference,the proposed model achieve the 4th among 25 teams.
Keywords/Search Tags:end-to-end speech recognition, attention, low resources, online decoding, multilingual
PDF Full Text Request
Related items