| Speech Synthesis,is a technology that converts text content into speech,and is widely used in various products.In recent years,with the needs of market development,people’s demand for Vietnamese speech synthesis has gradually increased,but the synthesized speech still has problems of poor naturalness,insufficient rhythm and slow synthesis speed.Focusing on the above problems,this paper conducts targeted research.Regarding the speech synthesis model,the specific research contents are as follows:First,an improved autoregressive Vietnamese speech synthesis model based on NAT(Non-Attention Tancotron)is proposed.Aiming at the problem that the mel spectrum generated by the NAT model is too vague,the Flow-based network structure is used in the model post-processing network to improve the naturalness and clarity of the synthesized speech.And in the encoder module,the method of superimposing hole convolution and ordinary convolution is used to learn more context information and enrich prosodic information.Compared with the generated mel spectrum features,the improved NAT model is better than the original NAT model,and the MCD value is 0.53 lower than the original NAT model.The MOS score is 0.42 higher than the Tacotron2 model,0.21 higher than the original NAT model,and 0.32 different from the real recording,which proves that the improved NAT model is better than the original model.Second,a modified non-autoregressive synthetic model based on VITS is proposed.Aiming at the heavy calculations in the VITS decoder that lead to slow compositing and the difficulty of finding the best alignment between text markers and spectral frames.This paper introduces an i STFT-based decoder to replace the upsampling structure,which greatly reduces the computational load of the decoder.At the same time,a duration search algorithm is used to obtain the best alignment between the text and the spectrum frame.The improved model is 0.78 lower than the original VITS model in terms of MCD value,and the synthesis speed is significantly improved.The MOS score is not only higher than the 0.21 of the original VITS model,but also only 0.06 different from the real recording,which further improves the quality of Vietnamese speech synthesis.Third,a prototype structure for Vietnamese speech synthesis was developed.Based on the research on speech synthesis,this paper designs and completes three modules of Vietnamese text regularization,long sentence segmentation and speech synthesis to realize the functions of the Vietnamese speech synthesis system.The results of the functional test and stress test show that the functions of the three modules of the Vietnamese speech synthesis system have been realized and can be used normally.The high-quality Vietnamese is synthesized,and its similarity with real people is as high as 98%. |