| Voice conversion(VC)is an intelligent voice technology that aims to achieve speaker identity conversion while keeping the content information of source voice unchanged.As an important branch in the field of voice conversion,sing voice conversion has many important applications in multimedia entertainment and voice interaction systems.With the development of artificial intelligence and neural network technology,sing voice conversion technology is also progressing rapidly,and various classical conversion models have achieved good conversion performance.In practical applications,a mature sing voice conversion technique not only needs to be able to perform the identity conversion between different sing voices well,but also achieve good conversion performance in the open set case given only the normal speech of the target.On the other hand,the operational efficiency of the model directly affects the storage and computational resources required in practical applications.Therefore,this paper discusses and investigates two aspects of improving the conversion performance of sing voice and the operation efficiency of model,and proposes a series of improvement methods.Firstly,in order to effectively realize sing voice conversion and broaden its application,this paper proposes the Style GAN sing voice conversion model,which extracts the identity information of the target singer through the style encoder and achieves a good sing voice conversion performance.Further,this paper introduces CBAM attention mechanism to improve the generator of model and proposes C-Style GAN sing voice conversion model,which improves the generation and expression ability of the model without increasing the depth and width of the network,enhances extraction of the details of sing voice spectrum,and effectively improves the quality of the converted sing voice.Subjective and objective experimental results show that compared with the Star GAN model,C-Style GAN model proposed in this paper,improves the average MOS by 36.18%,improves the ABX by 16.55%,and reduces the MCD of reconstructed sing voice by 13.60%,effectively improving the conversion quality of sing voice.At the same time,the model can complete the conversion with only the normal speech of target in the open set case,which can release the dependence on target sing voice and broaden its application range.Secondly,in order to improve the model efficiency,this paper introduces dynamic channel fusion to improve the dynamic convolution in the generator,and further proposes the DC-Style GAN sing voice conversion model.Rethinking the dynamic convolution from the perspective of matrix decomposition,the dynamic channel fusion achieves a significant dimensionality reduction in the potential space and alleviates the difficulty of joint optimization of dynamic attention and static convolution kernels,thus improving the operation efficiency.Subjective and objective experimental results show that compared with C-Style GAN model,the model has 66.87% fewer parameters and 34.09% faster training speed,while the average MOS value and average ABX value are basically unchanged.It is proved that the optimization scheme can substantially reduce the number of parameters of model,accelerate the training speed of model,and effectively improve model operation efficiency,thus making the model more lightweight,while ensuring that the conversion performance is basically unaffected.In summary,the DC-Style GAN sing voice conversion model proposed in this paper has good conversion effect and can complete the conversion with only a given target normal speech in the open set case.On the other hand,the model also has high operational efficiency and low training cost,which provides an important theoretical discussion and simulation study for sing voice conversion technology towards practical application. |