Font Size: a A A

Research On Training Data Expansion Method For End-to-end Speech Recognition Model Based On Text Data

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:J X GuoFull Text:PDF
GTID:2428330614950003Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The era of intelligence is accelerating.Speech,as one of the most natural and convenient ways of communication,is an important means to promote intelligence application in life and work.Automatic Speech Recognition(ASR)is a technology that converts the input speech signal into text,and then can understand its content.In recent years,with the development of sequence-to-sequence-based general modeling methods,an end-to-end speech recognition model has been proposed.Compared with traditional methods,the end-to-end speech recognition model only contains of a single sequence model,which can directly obtain the word sequence of the acoustic feature sequence,thus simplifying the process of speech recognition.At the same time,the end-to-end speech recognition model does not rely on language models and pronunciation dictionaries,reducing the requirements for expert knowledge.However,it usually needs a large number of speech-text pairs for the training of an end-to-end speech recognition model to achieve a better performance.In practical applications,it is very laborious and expensive to collect a large number of the paired data,resulting in error results for rare words and proper nouns in the end-to-end speech recognition model.Therefore,this paper proposes a training data expansion method of an end-to-end speech recognition model based on text data.The main work and innovations are as follows:(1)End-to-End Speech Recognition Model Based on RNN-TThe end-to-end speech recognition model based on RNN-T can take into account both acoustic information and linguistic information in the process of optimization.It is currently the best method in the field of end-to-end speech recognition.Therefore,this paper applies the RNN-T model to build an end-to-end speech recognition model as the baseline,and gives the experimental results.(2)A Training Data Expansion Method Based on Generative Adversarial NetworkIn view of the limitation that the RNN-T model cannot effectively identify rare and proprietary words,a method for training an end-to-end speech recognition model by using only a large amount of text data that is not paired with speech signals is proposed.Inspired by the adversarial training mechanism,this paper first uses a method based on the Generative Adversarial Network(GAN)that is designed to synthesize the corresponding pronunciation primitive sequence with a large amount of text data.Then the end-to-end speech recognition model is retrained by using the above text data and its corresponding sequence of speech primitives as the extended data.(3)A Training Algorithm Combining Generative Adversarial Network and Connectionist Temporal ClassificationIn the above data expansion method,since the complex mapping relationship between text sequence and pronunciation primitive sequence,and the simple structure of the discriminator in the GAN,the model collapse problem is prone to occur.To this end,this paper further proposes to combine both the loss functions of Connectionist Temporal Classification(CTC)and GAN under the framework of Multi-Task Learning(MTL)to jointly supervise the training of the GAN.The experimental results on the Chinese mandarin data set AISHELL-1 and AISHELL-2 show that,the system using the proposed expansion method in this paper achieves better recognition results than the baseline model.
Keywords/Search Tags:Speech Recognition, End-to-End Model, Data Expansion, Generative Adversarial Network, Multi-Task Learning
PDF Full Text Request
Related items