Font Size: a A A

Research On Multi-Modal Dubbing Synthesis

Posted on:2024-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:G E ChenFull Text:PDF
GTID:2568307160459344Subject:Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis is an important research direction in the field of computer speech with promising applications in real life.However,current speech synthesis systems are mostly based on single-modal inputs,including Text-to-Speech(TTS)and Lip-toSpeech(LTS).In practical dubbing applications,single-modal synthesis models have their respective problems: TTS that does not involve visual information is unable to adjust speech output according to video,and the synthesized speech is not related to the source video,making it difficult to achieve audio-visual synchronization;the main challenge of LTS is the complex mapping relationship between speech and lip movements,which often leads to differences between the synthesized speech and the actual spoken content.Besides,multi-modal speech synthesis faces problems such as a lack of dedicated visual encoders and difficulty in feature alignment and fusion.To overcome the limitations of existing speech synthesis systems and meet the requirements of accurate dubbing content and audio-visual synchronization in dubbing scenarios,this paper builds a non-autoregressive multi-modal dubbing synthesis system,MM2 Speech.It synthesizes speech in parallel with both given text and visual information.The text information ensures the accuracy of the output speech content,while the visual information provides audio-visual correlation information,such as lip movements and emotions.Experimental results on the CMLR dataset demonstrate the effectiveness of introducing multi-modal information in dubbing synthesis tasks,with the synthesized speech showing better audio-visual synchronization than TTS and higher content accuracy than LTS.To address the issue of redundant and underutilized information in the visual encoder of MM2 Speech,this paper further designs a region-based visual feature extraction scheme.It excavates lip features and facial features that are more relevant to the task,and uses these two features to respectively predict phoneme duration and predict pitch and energy.By combining the designed region-based visual feature encoder with MM2 Speech,a multi-modal dubbing synthesis model MM2Speech-LF is obtained.MM2Speech-LF further improves the synchronization degree between synthesized dubbing and video.Experiments verify the effectiveness of introducing different visual information for different variance predictions.Lip movement information play an important guiding role in predicting phoneme duration,while predicting pitch and energy focus more on facial static features.The text features and visual features extracted in above systems are of unequal length,making it difficult to fully fuse and utilize them.Therefore,this paper proposes a temporal alignment and deep fusion scheme for unequal-length features.The proposed scheme firstly determines the length of the Mel-spectrum according to the length of the video,and then applies up-sampling to the visual and text features to match the length of the Mel-spectrum sequences.Subsequently,the two-modal features are deeply fused based on bidirectional attention,and the fused features are used for pitch,energy prediction and the decoding process.Combining the alignment and fusion module with MM2Speech-LF,the MM2Speech-LF-DF model is further obtained.Experiments show that the effect of the deep fusion method is superior to that of nonfusion or other simple fusion methods such as addition and concatenation,which validates the effectiveness of temporal alignment and deep fusion scheme.
Keywords/Search Tags:Dubbing Synthesis, Multi-modal, Non-autoregression, Audio-visual Synchronization, Deep Learning
PDF Full Text Request
Related items