Font Size: a A A

A Study On Deep Learning-Based Voice Conversion For Identity Disguise In Voice Communication

Posted on:2023-12-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y DingFull Text:PDF
GTID:1528306902959089Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As one important technique in speech generation,voice conversion converts a source speaker’s voice to a target speaker’s voice while linguistic information remains,which makes the converted speech considered as spoken by the target speaker.Voice conversion techniques show important application value in personalized speech synthesis,audiobooks production,toys for entertainment,identity disguise in voice communication and so on.In recent years,with the development of machine learning,deep learning-based statistical parameter modeling has gradually become the mainstream approach to voice conversion.In these methods,voice conversion system usually consists of a feature extractor,an acoustic feature predictor,and a vocoder.The feature extractor is adopted to extract appropriate acoustic features from source speaker’s waveforms.Then,the acoustic features extracted from source speaker are sent into the acoustic feature predictor to generate target speaker’s acoustic features.Finally the converted waveforms of target speaker are reconstructed by the vocoder.In voice conversion methods,usually the naturalness and the similarity of converted speech are used as evaluation metrics.Compared with the applications such as audiobooks or toys,identity disguise in voice communication puts forward higher requirements to voice conversion techniques.First,in practical applications,it is difficult to obtain a large amount of parallel data for the source speaker and the target speaker,which restricts the performance of conventional voice conversion methods that rely on parallel data for model training;Secondly,in order to satisfy the requirements for voice communication,the converted speech need to be generated with a low-latency constraint.Conventional voice conversion methods which generate converted speech with a whole sentence as input can’t be used.Finally,with the development of speech spoofing detection,the increased adversairal ability of converted speech against such speech spoofing detector can further improve the preformance of identity disguise in voice communication with voice conversion techniques.Therefore,this thesis focuses on deep learning-based voice conversion for identity disguise in voice communication.The main research contents of this thesis include:First,the thesis studies generatively trained deep neural network(DNN)based voice conversion using deep autoencoders(DAEs)with binary distributed hidden units.When decoding acoustic features from hidden features by neural feature extractors,either the reconstruction error is high or over-smoothing effect appears,which resulting in a decrease in the quality of the converted speech.Therefore,this thesis proposes to integrate a new feature extractor,DAEs with binary distributed hidden units,into generatively trained DNN,to obtain a low reconstruction error while alleviating the over-smoothing effect,which improves the performance of converted speech.Secondly,the thesis studies sequence-to-sequence voice conversion based on limited parallel data.In sequence-to-sequence voice conversion method,parallel data is adopted to train the model which is difficult to obtain in practical applications.The quality of converted speech decreases significantly when limited parallel data is used.Therefore,this thesis proposes a training method by pseudo-parallel data generation for the case where the source and target speakers have a small amount of parallel data and a large amount of non-parallel data,in order to improve both the naturalness and similarity of converted speech on limited parallel data.Thirdly,the thesis studies low-latency recognition-synthesis-based any-to-one voice conversion.In existing low-latency voice conversion methods,DNNs are applied to predict the acoustic features of the target speaker,and the converted speech waveforms are reconstructed by traditional vocoders,which resulting the quality of the converted speech is not high.Besides,parallel data are used to train the conversion model.Therefore,this thesis proposes a low-latency recognition-synthesis-based anyto-one voice conversion system with non-parallel data.A frame-by-frame automatic speech recognition(ASR)model with a minimum mutual information loss is designed to extract bottleneck features with less speaker information.The naturalness of the converted speech is comparable with the upper-bound model without latency constraints.Finally,this thesis studies adversarial post-processing of voice conversion against spoofing detection.The state-of-the-art neural speech spoofing detectors can effectively distinguish synthetic and converted waveforms from natural ones.Therefore,inspired by adversarial example generation,this thesis proposes a method which post-processes converted speech against speech spoofing detectors to improve the ability of voice conversion against speech spoofing detection.
Keywords/Search Tags:speech signal processing, voice conversion, deep learning, non-parallel data, low-latency, adversarial learning
PDF Full Text Request
Related items