Font Size: a A A

Investigation On Deep Learning Based Voice Conversion

Posted on:2019-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:J H LaiFull Text:PDF
GTID:2428330590992284Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Voice Conversion is a speech processing technique to convert an original speech to the speech with another style.Voice Conversion has many applications.The most obvious use of voice conversion is to generate speech database with limited data for TTS.Meanwhile,voice conversion plays an important role in speech restoration,speech translation and some security related applications.Speaker conversion is the most important task in voice conversion and also is the main research topic of this paper.Voice conversion has two categories depending on the content of the database,one is with parallel data and the other with non-parallel data.Voice conversion with parallel data means the database includes speech from the source and target speakers with the same content,while voice conversion with nonparallel data means the database only includes speech with different content or small portion of the same content.The paper proposes a phone-aware voice conversion framework based on neural networks.ASR is used to extract phoneme alignment of the speech.Voice Activity Detection is used to get more precise speech boundary for the extracted phoneme.With the help of phoneme information,an improved DTW algorithm is used to get more accurate frame-level alignment.Finally,LSTM-RNNs is used as conversion model to convert spectral features with the help of phoneme features.The evaluation experiment shows the proposed phone-aware LSTM-RNNs system has significantly better performance than baseline LSTM-RNNs in both objective and subjective evaluations.The paper also proposes a dual learning based voice conversion framework without large amount of parallel training data.Small amount of parallel data is used to train an initialized voice conversion model.The dual learning mechanism is used to simultaneously train the spectral conversion model from speaker A to speaker B and speaker B to speaker A.A spoofing detection model is used as a supervision to keep intermediate spectral features to from distortion.The experiment shows the dual learning framework can improve the initialized conversion models in subjective evaluation while the objective evaluation of the model does not deviate from the normal value,which proves the dual learning can effectively use unparalleled data to improve the conversion model with supervision of spoofing detection.
Keywords/Search Tags:Voice Conversion, Neural Network, Dynamic Time Warping, Dual Learning, Spoofing Detection
PDF Full Text Request
Related items