As one branch of translation technology,simultaneous speech translation technology has a broad application value,such as the automatic subtitle generation for foreign language videos,simultaneous interpreting for international conferences and so on.However,compared with the developed neural machine translation technology,simultaneous speech translation technology is facing great challenges.The traditional cascade model which is composed of speech recognition model and machine translation model has natural disadvantages in processing delay,error propagation and so on.Although the end-to-end simultaneous speech translation model can avoid these problems,it is not easy to deal with information which consist of speech and text.Additionally,previous studies have shown that the Transducer end-to-end model can not only achieve a very low word error rate,but also have a very low streaming processing delay in the speech recognition task.Therefore,this paper will focus on the application of Transformer Transducer model in the speech translation task.The main work and innovation points are as follows:Firstly,according to the different word order of audio sequence and translation sequence alignment in speech translation task,this paper propose a new Transformer Transducer model and two different mask structures.In this paper,Conv-Transformer neural network structure is used to extract audio features in the transcription network module.Unidirectional self-attention Transformer neural network is used to encode the translation sequence in the prediction network module.And cross-attention Transformer neural network is used to characterize audio features and text features in the integration network module.In the model inference stage,this paper also designs two corresponding streaming decoding methods for low delay setting and high accuracy setting respectively.Secondly,we do a great deal of experiment on Transformer Transducer end-toend simultaneous speech translation model with different optimization methods.(1)We research the influence of pre-training method.In this section,the Transformer Transducer model parameters are initialized by pre-training speech recognition model and pre-training language model respectively and the result is analyzed by experiments.(2)We research the influence of additional auxiliary loss function.In this section,we experiment the off-line speech translation loss function,sequence level Transducer loss function regularization and translation delay loss function on Transformer Transducer model respectively.(3)We research influence of knowledge distillation method.In this section,we experiment the sequence level knowledge distillation method and analyze its optimization effect on the Transformer Transducer model.In addition,this paper also compares Transformer Transducer model with other state of the art end-to-end simultaneous speech translation models.Our model has achieved very good results on the MUST-C public dataset.Especially in the low delay part,our model gains significant improvements by over 8-10 BLEU points. |