Font Size: a A A

Audio-Video Based Cross-modal Speaker Retrieval And Recognition

Posted on:2021-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiFull Text:PDF
GTID:2428330611462396Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cross-modal retrieval and matching based on audio and video is a task to find the corresponding relationship between face and voice.A large number of cognitive science research has confirmed that human beings have the ability to match the same person's face and voice,which is very enlightening for the creation of natural human-computer interaction system and other multimedia applications.For cross-modal data with identity information such as face and voice,this paper studies the cross-modal face-voice retrieval and matching experiments in the following aspects:(1)A cross modal speaker tagging method for face-voice matching based on the structure of self encoder is proposed,introduceing the principle of joint consistency,combining with the training data with labeled information,constructing the cross modal retrieval and matching model of audio and video.In the feature extraction stage,the convolution neural network is used to extract the face image features,and the depth belief network is used to extract the features of the voice data.Finally,the softmax regression loss is accessed in the output layer of the self encoder model,and the supervised training strategy is added,finally the cross modal information is expanded into three different model structures.Experimental results on large-scale datasets show that the model can effectively improve the accuracy of cross modal annotation task of face and voice.(2)A cross-modal face-voice matching and retrieval model based on Coattention mechanism is proposed.In the feature extraction stage,vgg-16 and soundnet are used to extract face and voice features respectively.The model learns the common subspace embedding between face image features and voice features,introduces the co-attention mechanism to strengthen the similarity of original features,and uses the training method of triplet positive and negative samples to make the modal internal distance in the commonsubspace smaller and the modal distance larger,so as to achieve cross modal face-voice matching and retrieval tasks.(3)A dynamic cross-modal retrieval and matching model based on long short-term memory gate is proposed.For facial motion sequence and sound sequence data from the same video,facial landmark features are extracted based on VGGface,Mel-spectrum features are extracted from sound sequence,and huberloss distance of hidden layer is minimized through encoder and decoder model based on LSTM,And inter frame constraint to realize the mutual retrieval and matching of dynamic face speech sequences.The three cross-modal retrieval models proposed in this paper have been fully tested in the sitcom dataset and celebrity dataset,improved significantly in large-scale multi category tasks and dynamic tasks compared with the existing methods.
Keywords/Search Tags:Cross-modal retrieval, Face voice matching, Auto-Encoder, Co-attention mechanism, Long Short-Term Memory
PDF Full Text Request
Related items