| A voice signal is a signal generated by a person when they utter a voice,including the personality characteristics,linguistic information and emotional information of the speaker.The meaning of voice conversion is to convert the source speaker's voice to make it sound like the target speaker's personality,and have the same semantic information before and after conversion.Voice conversion can be divided into parallel corpus voice conversion and non-parallel corpus voice conversion.Parallel data means that the pronunciation duration,linguistic information,and emotional rhythm of the source speaker and the target speaker are similar.However,in practical task scenarios,collecting parallel corpus is very time-consuming and labor-intensive,especially it is not available to achieve parallel corpus in the field of cross-lingual conversion,medical assistance.In addition,even if such parallel data is collected,most voice conversion methods still need to align the training data.The alignment process will inevitably introduce errors and require other complicated processes,such as accurate corpus preprocessing or manual correction to solve the problem of misalignment.Due to the limitations of parallel voice conversion technology in practical applications,non-parallel voice conversion technology has become a hotspot and difficulty in current voice conversion research.Among them,the voice conversion method based on Star Generative Adversarial Networks provides the non-parrallel many-to-many voice conversion framework.Based on this framework,this paper proposes a many-to-many voice conversion method based on Dense Net Star to improve the performance of speech conversionOn the one hand,this paper proposes a voice conversion based on Star GAN combining i-vector.In order to improve the speaker similarity of the converted speech,this paper combine the i-vector into the Star GAN model which is a feature commonly used in the field of speaker recognition and a characteristic that can better represent the personality of the speaker so that it can help improve the speaker similarity of the converted speech.Subjective evaluation and Objective evaluation results show that the proposed method reduces the average MCD value of the converted speech by 3.25%,the average MOS value by 8.02% and the average ABX value by 5.25% compared with the baseline system.It proves that the method proposed in this paper clearly improves the similarity of speaker personality while improving the quality of speech.On the other hand,this paper introduced the Dense Net network into the star generaive adversarial network model in order to achieve better speech quality of the converted speech,The Dense Net network is helpful to address the degradation problem and improve the efficiency of the back propagation of the gradients during the training process,In this way,the extraction capability of the linguistic information in the encoding stage of the generator can be better mproved,thereby improving the speech quality of the converted speech.In addition,this paper introduced the Gaussian Error Linear Units to substitute the Rectified Linear Unit as the new activation function of the model so as to further solve the gradient disappearing problem accelerating the convergence speed.All in all,the two above improvements jointly improve the speech quality of the converted speech.Subjective evaluation and Objective evaluation results reveals that the proposed method reduces the average MCD value of the converted speech by 7.72%,the average MOS value by 15.24% and the average ABX value by 6.55% compared with the baseline system.It demonstrates that the method proposed in this paper greatly improves the quality of speech while improving the similarity of speaker personality... |