Audio-Video Based Cross-modal Speaker Retrieval And Recognition

Posted on:2021-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Li

Full Text:PDF

GTID:2428330611462396

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Cross-modal retrieval and matching based on audio and video is a task to find the corresponding relationship between face and voice.A large number of cognitive science research has confirmed that human beings have the ability to match the same person's face and voice,which is very enlightening for the creation of natural human-computer interaction system and other multimedia applications.For cross-modal data with identity information such as face and voice,this paper studies the cross-modal face-voice retrieval and matching experiments in the following aspects:(1)A cross modal speaker tagging method for face-voice matching based on the structure of self encoder is proposed,introduceing the principle of joint consistency,combining with the training data with labeled information,constructing the cross modal retrieval and matching model of audio and video.In the feature extraction stage,the convolution neural network is used to extract the face image features,and the depth belief network is used to extract the features of the voice data.Finally,the softmax regression loss is accessed in the output layer of the self encoder model,and the supervised training strategy is added,finally the cross modal information is expanded into three different model structures.Experimental results on large-scale datasets show that the model can effectively improve the accuracy of cross modal annotation task of face and voice.(2)A cross-modal face-voice matching and retrieval model based on Coattention mechanism is proposed.In the feature extraction stage,vgg-16 and soundnet are used to extract face and voice features respectively.The model learns the common subspace embedding between face image features and voice features,introduces the co-attention mechanism to strengthen the similarity of original features,and uses the training method of triplet positive and negative samples to make the modal internal distance in the commonsubspace smaller and the modal distance larger,so as to achieve cross modal face-voice matching and retrieval tasks.(3)A dynamic cross-modal retrieval and matching model based on long short-term memory gate is proposed.For facial motion sequence and sound sequence data from the same video,facial landmark features are extracted based on VGGface,Mel-spectrum features are extracted from sound sequence,and huberloss distance of hidden layer is minimized through encoder and decoder model based on LSTM,And inter frame constraint to realize the mutual retrieval and matching of dynamic face speech sequences.The three cross-modal retrieval models proposed in this paper have been fully tested in the sitcom dataset and celebrity dataset,improved significantly in large-scale multi category tasks and dynamic tasks compared with the existing methods.

Keywords/Search Tags:

Cross-modal retrieval, Face voice matching, Auto-Encoder, Co-attention mechanism, Long Short-Term Memory

PDF Full Text Request

Related items

1	Research On Image Caption Based On Attention Mechanism
2	Research On Relation Classification Via Bidirectional Long Short-Term Memory Networks With Attention Mechanism
3	Research On Chinese Event Extraction Via Incorporating Attention Mechanism And Long Short-Term Memory Networks
4	Image Captioning Based On Attention Long Short-Term Memory Network
5	Research On Long And Short-term Neural Network Recommendation Model Based On Self-attention Mechanism
6	Research On Image-Text Cross-Modal Matching Based On Attention Mechanism
7	Research On Image Caption Via Incorporating Attention And Long Short-Term Memory Network
8	Research On Image Caption Method Based On High Level Semantic Extraction And Attention Mechanism
9	The Cross-site Script Detection Based On Deep Learning
10	Research On Network Intrusion Detection Method Based On Bi-LSTM