A Research On Generating Portrait From Speaker Voice Based On Deep Learning

Posted on:2020-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2518306518963189

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The research target of this paper is to generate a face image from a short-time speaker voice and the generated face must be similar to speaker's real face.In order to solve this cross modal learning tasks,this paper designs an end-to-end Depth Neural Network(DNN)to learn an abstract mapping from speech to the human face in the form of self-supervision.The research divide DNN model into two parts.In the first part,an Speech Feature Extraction Network will extract low-dimensional face feature from spectrogram of speaker's voice.In the second part,an Face Features Decoding Network will translate face feature into an RGB face image.Since the speak video of speakers naturally contain two modal,the speaker's voice and the corresponding speaker's face image,which can be used as the input of human voice portrait model and the corresponding training label.The model of this research learns the abstract mapping from voice to face in the video in the way of self-supervision.Base on above method,this research proposes a new mapping strategy.The new mapping strategy add a prior information into mapping,prompt Depth Neural Network to learn the differences of speaker's face feature and prior face feature.The new map-ping strategy reduces the learning difficulty of network and improve the effect of human voice portrait model.The experiment of this research use a large-scale foreigner video dataset AVSpeech^[1]as the training set and testing set to check the effect of model.Besides,research generalize experiment to those special-group dataset,the small-scale Chinese video also tested the training data set.In the experiment,research combine qualitative evaluation criteria and quantitative evaluation criteria to evaluate the exper-imental results.Results show that the human voice portrait model proposed in this research could generate similar face images from speaker's voice as speaker's real face.

Keywords/Search Tags:

Audio-Visual Cross-Modal Learning, Face Generation, Speech Analysis, Residual Learning

PDF Full Text Request

Related items

1	Multimodal Cognitive Learning For Audio-visual Data
2	Study On Cross-modal Speech Recognition Methods With Fusion Lipreading
3	Research On Multi-modal Speech Separation Based On Audio-visual Combination
4	Cross-modal Metric Learning For Heterogeneous Face Recognition
5	Cross-Modal Generation And Synchronization Identification For Audio-Visual Data
6	Audio-Visual Speech Recognition And Its Applications
7	The Method Of Face Portrait Based On Speech
8	Research On Cross-modal Retrieval Of Speech And Image Based On Deep Neural Network
9	Cross-Modal Face Recognition Based On Deep Learning
10	Cross-modal Retrieval Research Based On Correlation Analysis And Structure Preserving