Font Size: a A A

Research On End-To-End Speech Translation

Posted on:2021-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiFull Text:PDF
GTID:2428330611498207Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Speech-to-text Translation(S2T)refers to the process in which the machine can automatically translate the Speech signals from the source language into the Text of the target language.The current mainstream practice is to cascade Speech recognition modules and machine Translation modules.Since speech recognition,machine translation,and speech translation are essentially tasks of converting from one sequence to another,and both speech recognition and machine translation can be modeled in an end-to-end manner,researchers have begun the study of end-to-end speech translation.Because modeling the two transformation steps of speech recognition and machine translation using one model increases the complexity of the transformation between input and output sequences,it may make the model more difficult to train or require more training data.In this context,this paper explores the model structure and training methods of end-to-end speech translation.Firstly,a new end-to-end speech translation model is built based on Convolutional Neural Networks and Transformer.The invariance of convolution in a Convolutional Neural Network is used to alleviate the variability of speech signals.The speech features extracted from the speech signal of a sentence are usually hundreds to thousands of frames,and several frames constitute a word.The self-attention mechanism in Transformer networks can be used to take into account both nearly and long-distance dependencies in frames of speech features.Moreover,compared with Recurrent Neural Network,Transformer network structure has the ability of parallel computing,so it has great advantages in the task of speech translation.Through experiments on a real scene corpus,it is found that the performance of the end-to-end speech translation model proposed in this paper is significantly better than that of the model based on Recurrent Neural Network.In addition,this paper proposes a method of Adversarial training to optimize the endto-end speech translation model,which is essentially a Generative Adversarial Network consisting of a Generator and a Discriminator.The Discriminator is used to distinguish between the input of target language sentence is from the real text or the output of the Generator.The Generator is an end-to-end speech translation model that can be learned to generate translations of the target language as realistic as possible,thereby spoofing the discriminator.In the process of the Adversarial training,the Discriminator and the Generator challenge and learn from each other step by step,and finally a better end-toend speech translation model is obtained.Moreover,the discriminator training is unsupervised,so it can make use of a large number of single language texts of the target language in its learning process.Experiments show that this method can significantly improve the end-to-end speech translation model.
Keywords/Search Tags:Speech Translation, end-to-end, Transformer, encoder-decoder network, Adversarial training
PDF Full Text Request
Related items