| Speech and lip motion(SLM)correlation analysis determine whether the lip movements during pronunciation is synchronous and consistent with the speech by the causal relationship between them,and it has a wide range of applications in speaker diarizzation and speech-driven lip-sync in movie.Conventional methods for detecting audio playback hackings are mostly only based on audio modal.Even though the audio-visual(AV)multi-modal use registration of audio and visual biometric information to improve security of identification system,which partially prevent hacking from happening,they overlook the important liveness information hidden inside audio and visual data: the strong correlation between speech signal variations and the corresponding lip motions.This thesis focuses on studying correlation of SLM,and it has the following contributions:(1)Based on the analysis of SLM consistence,a detection platform named SCUT-AV Playback Detection was developed.This system can analyze the correlation and time delay between speech and lip motion,detect hackings of record playback,etc.After studying a variety of playback attack methods that the SLM detection system could possibly face,four types of SLM inconsistence were defined.For contructing the above inconsistent data,this thesis used VidTIMIT,CUAVE and Chinese General Database as the major data sources.In order to overcome the incompleteness of VidTIMIT,a bimodal database was established as a complement.In addition,the audiovisual data of different vowels were extracted from the Chinese General Database to form a pronunciation database of vowels.(2)Considering that the conventional spatial lip motion analysis models overlooked the temporal relation between speech and lip motion,this thesis introduced the idea of joint analysis of space and time,and proposed a SLM consistency analysis algorithm based on fusing spatiotemporal correlation degree.The temporal consistent score is defined as the correlation between lip shape(height and width)and the speech amplitude,and the Coinertia is used as the initial correlation degree of speech and lip spatial characteristic.This thesis studied the difference of time delay between inconsistent and consistent data,since the typical CCA and QMI can be easily affected by insufficience of sample quantity and parameters,it suggested a Co IA method based on time delay estimation method.Experimental results show that the accuracy of this method is 95.4%,9.7% and 4.9% higher than the previous methods,respectively.In addition,this method estimated time delay in a large database with consistent and inconsistent AV data,and summarized the reasonable time delay distribution of consistent data,then combining the difference in the correlation degree of two data,an score adjusting scheme based on reasonable time delay and consistence level is propoesd.Finally,correlation score in spatial and temporal domains are combined to evaluate the consistence.In the four datasets that contain data with inconsistence in SLM,comparing with the model that only analyzed lip features in X-Y dimensions,experimental results show that the EER of SLM detection approximately reduced 8.2%,and the EER reduced 5.4% using consistence level adjustment method.(3)Given that the methods based on statistic and correlation tend to overlook the time-varying information between visual frames and barely represent the structural properties of the lip motion,the dynamic synchronous relationship between voice and lip motion of diverse syllable or word is represented as a pattern instead by the audiovisual coherence atom,and a SLM consistence detection method according to shift invariant(SI)audio-visual dictionary learning(AVDL)is introduced.By using temporal and spatiotemporal SI sparse coding model to represent the speech signal and lip motion signal that share the same time axis,an audio-visual dictionary was trained according to AVDL.Since the sparse coding step in AVDL will lead to oversize dimensions of the translation matrix,a new data projection method was proposed.and based on these pattern a new SLM consistency judgment criterion was proposed too.Compared with the CoIA and MI in statistical methods,and the biomodal linear prediction model and normalized correlation coefficient with SVM in correlation method,experimental results show that the EER of proposed method were decreased by 9.1%,17.6%,13.9% and 10.5% for small vocabulary corpus;and were decreased by 3.2%,12.4%,7.2% and 4.1% for large vocabulary corpus.(4)According to the methods proposed in(2)and(3)analysing the SLM consistence with the whole sentence and do not differentiate the lip changeing significantly or strong audio-visual correlation information lead to amount of computation and detection results vulnerable to silence or weak association fragment,combining the lip synching detection idea in singing,a new method baesd on AV vowel pronunciation events matching and time delay analysis in their location is proposed.Firstly,the training data is selected based on the vowel segmentation result.And for the vowel segmentation problem,an AV vowel segmentation method which has an accuracy rate up to 93.5% is proposed.Secondly,measuring the matching degree between audio and lip motion event with the AV dictionary of vowel,and analyzing the time delay distribution in position of the event,then fused to obtain final consistency score by GMM model.Experiments show that the computation cost of training and analysis is reduced by 35% compared to the method in(3),while the EER decreased by 2.1% and 4.6% compared to the method in(2)and(3),respectively.For analyzing the vowel units in depth,they are clustered through cohesion hierarchical clustering algorithm with dynamic mouth sequence features.After AV correlation analysis,19 vowels belongs to five categorys with significantly higher AV correlation degree are selected as specific pronuciation unit,and based on these units a new consistency method with lower computation cost is proposed.Experimental results show that the performance of using the specific vowel unit is close to that using the whole sentence,and on the first three categories of SLM inconsistent data the EER even reduced by 1.2%,0.9% and 0.5%.At the same time,the EER totally fell by 4.8% after combining the analysis of time delay distribution in vowel positions,among inconsistent data,the third and fouth type is reduced by 4.9% and 10.6%. |