Application Research Of Attention-based End-to-end Speech Recognition

Posted on:2021-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:B J Liu

Full Text:PDF

GTID:2428330611466448

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the improvement of computing performance and the development of big data,the application of deep learning greatly decreases the error rate of speech recognition systems,making the Hidden Markov Model-Deep Neural Network(HMM-DNN)based systems become mainstream.In recent years,end-to-end speech recognition methods have attracted extensive attention.Different from the traditional method,which requires a series of training methods to obtain frame-level alignment labels,end-to-end methods directly train the mapping relationship between speech features and text,simplifying the training process of the speech recognition model.End-to-end recognition methods can be roughly divided into two categories: Connection Temporary Classification(CTC)method,which directly trains frame-level alignment,and attention-based method,which focuses on the correspondence between feature sequence and text sequence.This work mainly focuses on end-to-end speech recognition models based on attention mechanism.The main research results include:1.End-to-end speech recognition models tend to overfit when training data is limited because they often have a large number of parameters.As a result,these models performs worse than the traditional hybrid models.This research,based on the 8-hour small-scale English dataset Timit,proposes a set of parallel modeling methods and model structures called TDNNTransformer structure and introduces LDA(Linear Discriminant Analysis)to reduce the training difficuly of feature.In this way,the proposed model performance is close to the traditional methods in low-resource data set.2.Due to the global dependence of attention mechanism,streaming real-time speech recognition cannot be supported.This research focus on the problem about online streaming decoding existing in practical engineering applications,and a model based on multi-head monotonic chunkwise attentions is proposed for fast streaming decoding.After using insertion pooling to further improve the performance,the proposed model obtained better recognition performance than the traditional model on Tencent's 18,000-hour vehicle-based internal dataset,making commercial applications possible.In addition,on the 100-hour Aishell-1 Chinese dataset,the proposed model can still achieve similar recognition performance compared with other existing models.3.Multilingual speech recognition is another challenge today.End-to-end models can make better use of contextual information and improve the accuracy of mixed language.This research proposed an improved LAS(Listen,Attend and Spell)structure,combined with the BPE(Byte Pair Encoding)algorithm,and a batch training method based on probability sampling,which effectively improves the end-to-end models' performance on Chinese-English multilingual speech recognition.In the Chinese-English multilingual challenge organized by the ASRU(Automatic Speech Recognition and Understanding)conference,the proposed model achieve the 4th among 25 teams.

Keywords/Search Tags:

end-to-end speech recognition, attention, low resources, online decoding, multilingual

PDF Full Text Request

Related items

1	Research On Mongolian Online Speech Recognition With Scarce Data Set
2	Diarisation And Recognition For Multilingual
3	A Study On Low-resource Multilingual Speech Recognition Based On Transfer Learning
4	Research On Emotional Recognition In Multilingual Speech Signal
5	The Alorithm Of Embedded Continuous Speech Recognition
6	Speech Emotion Recognition Based On Neural Network And Attention Mechanism
7	Study On Attention Based Speech Emotion Recognition
8	Research On Online Tibetan Speech Recognition System
9	Phonemes Associated Multilingual Speech Fusion
10	Research On Decoding Technology Of Chinese Continuous Speech Recognition