In recent years,speech synthesis technology has made rapid development.The speech synthesized in a single language has a very high degree of intelligibility and naturalness.But once the text provided to the model included words from the untrained language,the performance of the model dropped dramatically,and the results synthesized for these words were almost noise.Code-switching refers to the phenomenon that two or more languages or language variants are used in the same discourse,and the need for code-switching is widespread in speech synthesis.Using a single language model to complete the task of code-switching puts forward higher requirements on the training set However,it is difficult to acquire such annotated multi-language corpus.At the same time,when completing speech cloning task,the existing single-language model needs training data of a person speaking multiple languages,which is also difficult to acquire in reality.Based on the Tacotron2 model,the multi-language model used in this paper adds the parameter generator,modifiers the encoder structure of Tacotron2,and adds the adversarial speaker classifier.Using only monolingual corpus can clone a person’s voice well,and at the same time can perform code-switching tasks well.The main work of this paper is as follows:(1)Explore the silent clips in audio and propose a splicing restoration scheme for audio frames.In the multi-language experiment,the corpus from different languages is required to participate in the training at the same time.The Tibetan corpus used in this experiment was recorded by the laboratory when completing other tasks.This batch of corpus has a big gap with the published speech synthesis corpus CSS10(it includes Chinese,Spanish,Finnish,German,Hungarian,Dutch,French,Greek,Japanese and Russian)in terms of sampling rate and audio duration.Firstly,this paper unmute the speech in Tibetan corpus,and optimized the Voice Activity Detection(VAD)logic of voice frame splicing.Then the text length and corresponding audio duration of all the samples used(from all the languages in the training set)were calculated,and the potential problem samples are eliminated by deleting the samples whose audio duration is much different from the mean value.(2)This paper verifies from both experimental and theoretical levels that the letter series is better than the phoneme series in Tibetan.In this paper,the Tacotron2 experiment proves that the quality of the model synthesized audio is better than that of the phoneme sequence when the letter sequence is used as text in Tibetan.(3)A multi-language model based on parameter generator is used to complete the synthesis.The multi-language model used in this paper is based on the Tacotron2 framework,and the parameter generator module is added to enable it to complete multi-language synthesis.In this paper,a parameter generator and a local shared encoder are used in the encoder.Knowledge sharing between languages can improve the synthesis effect of the model in the target language.In this paper,900 samples were selected for each of the 11 languages in the multi-language experiment,and then600 samples were selected for the experiment again.The experimental results show that the synthesis effect of the multi-language model used in this paper is far better than that of the single language model when the target language corpus is the same.(4)In the multi-lingual model,the adversarial speaker classifier based on domain adaptation theory is used to complete the task of speech cloning and code-switching.In the speech cloning task,this paper adds the adversarial speaker classifier on the basis of the multi-language model,and forces the model to learn features unrelated to speakers through adversarial training,so as to promote the transfer of knowledge between languages.This paper expands the data set,and adds the high speech quality part of Common Voice data set on the basis of the original data set.The evaluation of synthesized audio shows that the model used in this paper achieves good results in both tasks. |