| Speech synthesis technology studies how to convert text into audible speech information,and has a wide range of application scenarios in intelligent voice assistant,intelligent navigation,intelligent speaker,virtual anchor and other fields.The basic theory of advanced digital signal processing has laid a solid foundation for speech synthesis.In recent years with the rapid development of deep learning,speech synthesis technology made great progress,especially after the mergence of sequence to sequence network with attention mechanisms.The quality of synthetic speech is close to a real voice,speech synthesis technology has reached the level of practical application.However,data set with insufficient mount of training data often limits the performance of speech synthesis systems in practical applications,and the long training time of models seriously reduces the usability of speech synthesis systems.Therefore,aiming at the above problems,this thesis is dedicated to design and implement a speech synthesis system that can synthesizes high quality speech efficiently on the small data set.This thesis proposed some improvement methods to the existing speech synthesis systems based on in-depth investigation.In addition,an online visual realization of the system is built using web technology.The research content of this thesis is divided into the following three parts:First,in order to achieve the goal of speech synthesis for a specific small data set,this thesis firstly studies the network architecture of Tacotron2 speech synthesis system and changes the dimensions of the tensor in the model to make decoder decode as a group to accelerate the model.By changing such that,the training speed of accelerated model is 2.8 times as fast as the original,and the generation speed is 1.4 times faster;Secondly,according to the specific small data set,the idea of transfer learning is introduced to train the model to achieve the goal of speech synthesis.Finally,a new mapping is introduced to preprocess the Mel spectrum,which makes Tacotron2 model more powerful in feature learning.The mean opinion score of reconstructed speech is improved by approximately 0.8 points using the Griffin-Lim vocoder,which verifies this new mapping significantly improve the performance of the acoustic model of the speech synthesis system.Second,based on the implementation of speech synthesis for small data,this thesis introduces Mel Gan,a vocoder based on neural network,to improve the quality of synthetic speech further.Mel Gan vocoder is trained by five different training methods,and the best performance of the vocoder is explored.The best performance of Mel Gan vocoder is compared with other vocoders.When other variables such as acoustic model are consistent,Mel Gan vocoder is 15 times more available than Wave Glow vocoder in terms of time spent on training,1.69 times faster in speech synthesis and about 0.79 points higher in mean opinion score than Griffin-Lim vocoder.Thirdly,this thesis uses web technology to design the online speech synthesis system,which is more convenient for users to use the speech synthesis system.Through the feedback provided by users,it is easier to optimize the algorithm,improve the model architecture,and make the system more perfect. |