| As an important research branch in the field of speech processing,the purpose of voice conversion(VC)is to complete the conversion of speaker identity while keeping the semantic information of the source speech unchanged.Benefiting from the development of deep learning,VC has achieved many excellent achievements from parallel to non-parallel conversion.With the help of neural network technology in deep learning,in-depth research and technological innovation of VC have become the main trend in this field,among which the method based on generative adversarial network has become a research focus.Based on the non-parallel Star GAN-VC baseline model,this thesis conducts related research on the improvement of speech quality and personality similarity,and proposes a series of methods to improve the performance of the model.First,in order to improve the performance of the baseline model and generate better quality converted speech,this thesis improves the generator network by combining the ESR network into the generator network of the Star GAN-VC.The deep features of the input speech spectrum are extracted by the ESR network,and then fused with the shallow features extracted by the generator-encoder network,thereby improving the quality of the converted speech.Meanwhile,in order to further extract the fused spectral features output by the ESR network and the generator-encoder network,improve the quality of the converted speech and model performance,this thesis adds DSNet to the generator of the improved model,and proposes the method of DS-ESR-Star GAN-VC.With its dense weighted normalized shortcut structure,DSNet enables the model to achieve better performance without significant increasement of training time.The subjective and objective evaluations demonstrate that the average MOS of the proposed method is increased by 22.25%,the ABX is increased by 4.73%,and the MCD is decreased by 4.38%compared with the baseline model,indicating that the quality of the converted speech is improved,meanwhile,the personality similarity of the converted speech is improved.Second,in order to effectively improve the personality similarity of the converted speech and further expand the application of the model,this thesis proposes the DS-ESR-Style GAN-VC.The proposed model removes the generator’s one-hot label instead of using a style encoder to extract speaker style feature,then embeds it into the generator-decoder network via adaptive instance normalization.Meanwhile,this thesis optimizes the loss function of model training,so that the generator can learn speaker style feature conversion during training stage and improve the personality similarity of converted speech.The subjective and objective evaluations demonstrate that the average MOS in the closed set case is increased by 21.28%,the ABX is increased by 8.16%and the MCD is decreased by 7.36% compared with the baseline model,indicating that the proposed method improves the personality similarity of the converted speech and ensures the quality of the converted speech.At the same time,the proposed method completes the voice conversion in the open set case without affecting the generation quality,and expands the scope of application of the method from the closed set to the open set.In summary,by combining DSNet and ESR network in the baseline model,the proposed method improves the performance of the model,and improves the quality and personality similarity of the converted speech.In order to effectively improve the personality similarity of converted speech and further extend the model from the closed set to the open set,this thesis proposes the Style GAN-VC combining DSNet and ESR network,which can extract speaker style feature through the style encoder and embed it into the generator-decoder network,improving the performance of the model and expanding its scope of application from the closed set to the open set,and providing an important theoretical discussion for the practical application of voice conversion technology. |