Font Size: a A A

Research On Lip To Speech Synthesis Algorithm Based On Multimodal Feature Fusion

Posted on:2023-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:R ZengFull Text:PDF
GTID:2558307118999499Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Input a silent speaking video,the goal of lip to speech synthesis(Lip2Speech)is to reconstruct the corresponding speech according to the input video,which should contain information about the content of the speech-related information of the speaker.Increasingly researchers are paying their attention to the field of Lip2 Speech due to its huge market demand and extensive development prospects.In Lip2 Speech,networks need not only to learn to extract the speaking content of the video,but also speaker’s speech-related information.Existing Lip2 Speech algorithms take little or no account of speaker’s speech-related information due to the singularity of the input modality,resulting in problems such as losing speaker characteristic information in the reconstructed speech.In addition,most of the previous works are trained for a specific speaker,and the generalization performance of model is insufficient.To address the above problems,this thesis proposes a lip to speech synthesis algorithm based on multimodal feature fusion,and the main research contents of this thesis are as follows:(1)To address the problem of speaker characteristic information loss,we propose to add a speech identity encoder to the main network,which is used to assist the main network in extracting the speaker characteristic information.The network uses a stack of 3D-CNN(3-Dimentional Convolutional Neural Network)to extract the speech content information from the input silent speaking videos,and reconstructs the speaker’s speech by fusing the speaker characteristic features with the video speech content features.The experimental results show that,on GRID dataset and speaker dependent way,the intelligibility of the reconstructed speech for a specific speaker reaches 0.736 and the identity similarity between the generated speech and the real speech is 0.898.In addition,we also conducted experiments under the multispeaker condition,and all the metrics are improved,which indicates that adding the pre-trained speech identity encoder to the network can effectively extract the speaker characteristic information and improve the quality of the reconstructed speech.(2)To address the problem of insufficient model generalization of multispeaker under restricted vocabulary,a cross-modal adversarial memory module is proposed.The proposed cross-modal adversarial memory module can synthesize corresponding speech from different speakers or even from videos which speakers have never been seen during training.The cross-modal adversarial memory module receives source modality such as video frames as input,and stores target modality such as speech-related features which are addressable by a given video feature,and eliminates the gap between the source and target modalities by a modal classifier which is trained adversarially.Specifically,the cross-modal adversarial memory module saves the features of the source and target modalities and close the gap between different modalities,where the source modality is the input of the network and the target modal features are what the network wants to obtain from the memory module.An associative bridge is constructed based on the interrelationship between the source and target memories in the cross-modal memory module,and the interrelationship is learned through the associative bridge so that the proposed framework can still obtain relevant features of the target modality in the memory network and provide rich information for the downstream tasks even when only the source modality is used as an input.This chapter validates the proposed framework on the GRID and Lip2 Wav datasets and shows that the proposed method outperforms baseline methods in both multi-speaker and single-speaker settings,verifying the effectiveness of the cross-modal memory module.In summary,in testing and inference stages of the Lip2 Speech model,even using a silent video without corresponding speech as input,the proposed multimodal feature fusion networks are used to simultaneously utilize multimodal information,thus reconstructing a richer speech.
Keywords/Search Tags:lip to speech synthesis, speaker characteristic information, multimodal feature fusion, cross-modal identifier, cross-modal adversarial memory module
PDF Full Text Request
Related items