Font Size: a A A

Research On Multimodal Voice Conversion Under Adverse Environment Using Deep Convolutional Neural Network

Posted on:2021-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y T HuFull Text:PDF
GTID:2428330629980351Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Voice conversion is a technology that convert the personalized voice of a source speaker into the personalized voice of a specific target speaker.It is widely used in areas such as the concealment of speaker identity,dubbing of film,multimedia entertainment,medical field,etc.At present,the research on voice conversion is mainly concentrated in clean and noiseless environment.However,the presence of noise can severely affect the performance of voice conversion systems.And it is difficult to a void noise interference in both human-to-human communication and human-computer interaction in daily life.Therefore,improving the robustness of the voice conversion model in noisy environment is an urgent problem in practical applications.Related research shows that adding visual information in a noisy environment can improve the effect of speech enhancement.Inspired by this research,this thesis intends to establish a multimodal voice conversion model combining visual and acoustic information to improve the robustness of the voice conversion system in noisy environment.The research content of this thesis mainly includes the following two parts:(1)Building a Mandarin multimodal speech databaseAccording to the survey,there is currently no complete multimodal speech database publicly available in domestic.In order to conduct research on Chinese multimodal voice conversion,this thesis records a Mandarin multimodal speech database.This database includes audio and video signals of normal speech and whisper,therefore the selection of corpus were according to the pronunciation characteristics of Mandarin and whisper.A total of 103 syllables and 100 short sentences commonly used in Chinese were selected as corpus.The database was recorded by 5 males and 5 females with a total of 10 standard Mandarin pronunciation personnel,and finally collected 4,060 videos and 4,060 segments of speech.After a series of processing of the recorded speech database,the final multimodal speech database includes: speech,original video sequence during pronunciation,facial picture sequence,lip picture sequence,the coordinates of 106 key points on the face,as well as the annotation data of syl able.(2)Purpose a multimodal voice conversion method in noise environmentHuman speech perception is multi-channel in nature.There is an obviously complementary relationship between visual information and auditory information.And visual information can be used as a supplement to acoustic information.Therefore,the performance of the voice conversion model in a noisy environment can be improved by adding lip image as auxiliary information.This thesis proposed a multimodal voice conversion with deep convolutional neural network(MDCNN)under noisy environment.This model uses two convolutional neural networks(CNNs)to extract lip sequence features and acoustic features,respectively.Then,the visual and acoustic features extracted from CNN were fused and fed to the fully connected layer to map the relationship between the audio-visual features of source speaker and acoustic features of target speaker.To test the MDCNN based voice conversion model proposed in this thesis,6 different noises are selected from the NOISEX-92 noise library,and 7 different signal-to-noise ratios are used to construct 42 different noise environments.The experimental results prove the effectiveness of the proposed MDCNN method in noisy environments.
Keywords/Search Tags:Mandarin multimodal speech database, Multimodal voice conversion, Noisy environment, Convolutional neural networks
PDF Full Text Request
Related items