Font Size: a A A

Research On Speech Synthesis Technology Based On Improved Mel Spectrogram

Posted on:2024-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuoFull Text:PDF
GTID:2568307124472034Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Speech synthesis(also known as Text-to-speech,TTS)is the conversion of text into speech.It is one of the important technologies of human-computer interaction.It has a wide range of application scenarios and has important research significance in the field of artificial intelligence.With the rapid development of deep learning technology,speech synthesis technology has broken through the bottleneck and synthesized speech close to human voice.However,the training model of speech synthesis method based on deep learning still requires large-scale source speech data,and the Mel spectrogram features used still rely on traditional extraction techniques.This paper conducts the following research on the above questions:First,improve the traditional Mel spectrogram feature parameter extraction algorithm.The traditional Mel spectrogram feature parameter extraction technology mainly performs short-time Fourier transform(STFT)on speech signals,but STFT is more suitable for processing stationary signals.For non-stationary speech signals,continuous wavelet transform(CWT)shows great performance.Advantages,therefore,a signal reconstruction algorithm based on CWT is proposed,which can accurately transform high-frequency and low-frequency signals to obtain better speech information,thereby improving the Mel spectrogram feature extraction method and improving the Mel spectrogram of speech samples Feature Accuracy.Second,improved encoder module in Tacotron2.The encoder of the Tacotron2 model only uses the output of the last layer of the network as the linguistic feature sequence,ignoring some of the information carried in the output sequences of other layers of neural networks.In order to retain the input text content to the greatest extent,the output of each layer of the network in the encoder is given different weights and then the residuals are added,and the result is used as the input of the attention mechanism to improve the alignment of the linguistic feature sequence and the Mel spectrogram feature sequence Therefore,the accuracy of Mel spectrogram prediction can be improved.Third,a TalkersGAN model is proposed based on CycleGAN-VC.CycleGAN-VC generates the voices of n speakers,and needs to train n~2-n generative models,which requires a large amount of calculation and high redundancy.In response to these problems,the TalkersGAN model was proposed on the basis of CycleGAN-VC.This model uses a single generative confrontation network to realize non-parallel multi-speaker feature mutual mapping,reduce the amount of calculation and model size,and speed up the training of the model.The model only It takes a few minutes of training with the speaker’s voice to synthesize high-quality voice.The experiment found that the RMSE value of the improved Mel spectrogram is smaller than that of the traditional Mel spectrogram,and the performance of the improved Mel spectrogram is better;the Mel spectrogram predicted by the improved Tacotron2 model is already very close to the Mel spectrogram of the source speech,and in It performed well in processing the details of the Mel spectrogram,the TalkersGAN model used small-scale speech data,shortened the training time,and increased the rate of speech synthesis,and the synthesized speech waveform achieved a score of 3.85 in the MOS evaluation.The TalkersGAN model has good expressiveness in synthesizing the speech of different speakers.
Keywords/Search Tags:Speech Synthesis, Mel spectrogram, Tacotron2, TalkersGAN
PDF Full Text Request
Related items