Voice style transfer is a very broad statement,such as "voice cloning","multi-speaker style migration","style migration","voice conversion",etc.appearing in various literatures can be called voice style transfer technology.This paper mainly discusses the conversion of speech features in voice style transfer,referred to as voice conversion,which is to change the voice of the source speaker into the characteristic voice of the target speaker.The research in this paper is to give the human voice style to the robot,so that the robot has human voice sounds or the voice characteristics of certain people,and creates a machine voice with a specific style.With the increasing use of intelligent voice technology in the medical field with the help of big data and artificial intelligence,voice technology has gradually evolved from the initial speech recognition technology to the intelligent voice-assisted stage and has achieved many meaningful results in the medical field.For example,medical accompanying robots and auxiliary treatment robots.In the treatment of children with autism,personalized voice is effective for early intervention in autism,so voice style transfer techniques can be used to personalize voice generation scenarios for accompanying robots and assistive treatment robots.In the traditional voice style transfer,the mean square error is used as the loss function,which causes the problem of synthetic speech transition smoothing and sub-optimal perception.The synthesized voice signal has low naturalness,and the newer method needs to build the text and duration mode,although the naturalness of synthesized speech is improved,it is still difficult to generate personalized voice and modeling,and the computing resource consumption is relatively large.In order to improve the above problems,this paper proposes the use of the idea of Generative Adversarial Networks(GAN)to achieve voice style transfer.The mean square error is compensated by generating a loss function against the network to solve the problem of transition smoothing and perceptual suboptimal while reducing the complexity of the model.This paper establishes a voice style transfer model based on the generated confrontation network,and applies the CUM ARCTIC voice dataset and Tsinghua University Chinese speech data set THCHS-30 to verify the experiment,respectively,for male to female,female to female voices conversion.The innovation of this article is:(1)Proposed and established a Generative Adversarial Networks model for voice style transfer(VSTGAN)(2)In order to avoid the problem that the general GAN?model is difficult to train,in VSTGAN,the generator is designed as a highway network,which effectively solves the training of the VSTGAN model.The voice style transfer realized in this paper is subjective and objective evaluation based on the MOS standard established by the international ITU organization,and is generated by the Deep Belief Network,Long Short Term Memory Network,Highway network and WaveGAN as the control group.Voices are compared for features.The results show that compared with the four voice conversion effects,the proposed method not only has a higher score on the MOS subjective score of the generated voice,but also has a good performance on the Mel spectrum fidelity and the spectral fidelity performance parameters.There is also a better performance in generating a confrontation network on the score of the blind listening test ABX.On this basis,this paper proposes a way to map the traditional synthetic mechanical voice to the parent's voice,which is used to replace the mechanical voice of the rehabilitation robot and the accompanying robot.In this way,this paper implements the theoretical combination and technical realization of the combination of personalized voice conversion technology and generative confrontation network in deep learning,and has a new improvement in naturalness and intelligibility performance.Finally,a simple analysis of the ideas presented in this paper is expected to play an active role in the adjuvant treatment of autistic patients. |