Font Size: a A A

A Research On Generating Portrait From Speaker Voice Based On Deep Learning

Posted on:2020-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2518306518963189Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The research target of this paper is to generate a face image from a short-time speaker voice and the generated face must be similar to speaker's real face.In order to solve this cross modal learning tasks,this paper designs an end-to-end Depth Neural Network(DNN)to learn an abstract mapping from speech to the human face in the form of self-supervision.The research divide DNN model into two parts.In the first part,an Speech Feature Extraction Network will extract low-dimensional face feature from spectrogram of speaker's voice.In the second part,an Face Features Decoding Network will translate face feature into an RGB face image.Since the speak video of speakers naturally contain two modal,the speaker's voice and the corresponding speaker's face image,which can be used as the input of human voice portrait model and the corresponding training label.The model of this research learns the abstract mapping from voice to face in the video in the way of self-supervision.Base on above method,this research proposes a new mapping strategy.The new mapping strategy add a prior information into mapping,prompt Depth Neural Network to learn the differences of speaker's face feature and prior face feature.The new map-ping strategy reduces the learning difficulty of network and improve the effect of human voice portrait model.The experiment of this research use a large-scale foreigner video dataset AVSpeech[1]as the training set and testing set to check the effect of model.Besides,research generalize experiment to those special-group dataset,the small-scale Chinese video also tested the training data set.In the experiment,research combine qualitative evaluation criteria and quantitative evaluation criteria to evaluate the exper-imental results.Results show that the human voice portrait model proposed in this research could generate similar face images from speaker's voice as speaker's real face.
Keywords/Search Tags:Audio-Visual Cross-Modal Learning, Face Generation, Speech Analysis, Residual Learning
PDF Full Text Request
Related items