Font Size: a A A

Multi-View Voice And Lip Motion Consistency Judgment Based On Lip Reconstruction And Multi-Key Segment Joint Analysis

Posted on:2024-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:C LuoFull Text:PDF
GTID:2568307115989599Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
Voice and lip motion consistency judgment is the use of the causal relationship between the speaker lip motion and voice to determine whether audio and video were recorded simultaneously.Existing voice and lip motion consistency methods have mostly focused on studying frontal lip data and have not considered the effects of angle changes during video acquisition on the results.Moreover,the lack of audio content filtering in the analysis makes the results vulnerable to weak correlation segments such as silence and noise,resulting in poor robustness of lip motion and voice consistency judgment.To address these issues,this thesis studies the effect of lip multi-view changes on voice and lip motion consistency.The main research contents are as follows:(1)We propose a multi-view decoupling representation generative adversarial network(GAN)model based on lip rotation and detail enhancement.The pose weight module in the network proposed in this thesis can extract the angle feature of the profile lip image and combine it with the set one-hot encoding,and then fuse it with the lip feature and input it into the generator to generate the frontal lip image.At the same time,to preserve more lip details in the generated lip images,we adding symmetric loss,reconstruction loss,and pixel loss to improve the quality of the generated images based on the adversarial loss.Furthermore,the network can dynamically allocate the weight coefficients of image features at different angles according to the lip deflection angle in the multi-feature fusion mode.Finally,the multi-image features are fused into new features to frontal lip generation according to the weight coefficients.The experimental results show that our model can generate high-quality frontal lip images,and the mean value of structural similarity(SSIM)under four profile angles of Oulu VS2 database is 0.75,and the quality of frontal lip images generated by multi-feature fusion is higher than that generated by single lip image.(2)We propose a voice and lip motion consistency algorithm based on key sound detection,frontal lip reconstruction,and time delay combination.First,we select key vowel sounds with significant lip changes as the key sound events and filter out key sound segments from the audio using a Hilbert envelope and zero-frequency filter combination key vowel detection method.Then,we select the corresponding video segment positions for the key sound segments’ start positions and perform frontal lip reconstruction on the lip images in the video segments using the lip reconstruction model.Next,we analyze the audio-video correlation of each key sound segment using a bimodal deep correlation model and use covariance matrix re-estimation to reduce the impact of key sound filtering on the total sample size.Finally,we propose a scoring mechanism that combines the scores of multi-key sound segments based on time delay and correlation differences to determine consistency.Experimental results show that our voice and lip motion consistency judgment method outperforms several mainstream algorithms in terms of performance,and the overall EER under four profile angles in the Ouluvs2 database decreases by 0.7%,5.5%,9.9%,and 17.8% respectively compared with that before reconstruction.
Keywords/Search Tags:Consistency judgment, Frontal reconstruction, Generative adversarial network, Canonical correlation analysis, bimodal
PDF Full Text Request
Related items