Font Size: a A A

Research On Deep Audio-Face Feature Fusion For Speaker Recognition And Annotation

Posted on:2019-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:J J GengFull Text:PDF
GTID:2428330566493537Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Speaker recognition has received a lot of attention in recent years due to the growing security demands in real applications,and the single biometric feature based method is generally insufficient to achieve good speaker identification performance due to its significant variations corrupted by uncontrollable environments.In recent years,more and more works concerning on multi-modal data fusions have emerged solve these challenging problems,for reason that the various features extracted from multi-modal data are more richer than those extracted from only single modality.Therefore,the development of more efficient multi-modal fusion methods shall play an important role in these attractive applications.In this thesis,we first present an efficient face-audio fusion method by using multimodal correlated neural networks to achieve speaker recognition.Within our proposed approach,the facial features learned by convolutional neural networks are compatible with audio features at high-level and the heterogeneous multi-modal features can be learned automatically.Accordingly,we propose a correlated neural networks to fuse the face and audio modalities at different level such that the speaker identity can be well identified.The experimental results have shown that our proposed multi-modal speaker recognition approach can produce better performance than single modality,and the feature-level fusion yields comparative and even better results than the decision-level case.Further,we present an efficient speaker naming approach via deep discriminative audio-face fusion and co-attention learning.First,we start with VGG-encoding of the face images and extract the Mel-Frequency Cepstrum Coefficients(MFCCs)of audio signals.Then,two audio feature encoding modules,namely Long Short-Term Memory(LSTM)encoding and 2D-convolution encoding,are addressed to alternatively discriminate the audio attention vectors.Meanwhile,we exploit an end-to-end coattention learning scheme by convolution-softmax encoding of audio-face concatenated features.Further,we exploit a factorized low-rank bilinear pooling approach to efficiently and effectively fuse the derived audio-face attention vectors,the experimental results have shown that our proposed speaker naming approach yields comparative and even better results than the state-of-the-art counterparts.
Keywords/Search Tags:Audiovisual, Feature Fusion, Speaker naming, low-rank bilinear pooling
PDF Full Text Request
Related items