Speech synthesis technology is a cutting-edge technology in the field of information processing.In recent years,my country’s Tibetan information technology has achieved leapfrog development,which has played a positive role in the economic and social development of Tibet.However,there is still much room for development in the research of Tibetan speech synthesis technology.This is mainly due to the lack of in-depth knowledge of Tibetan speech.Research,and limited by the lack of resources,a speech synthesis system that achieves practical effects is rare in this research field,and the existing research results are in the experimental stage.Therefore,in-depth research and early resolution of the key technologies of Tibetan speech synthesis and forming an overall solution are currently vital research content in the field of information processing.This will help promote the development and prosperity of Tibetan culture and expand the development of Tibetan culture.International influence,strengthen the self-development capacity of information technology in my country’s Tibetan areas,and accelerate the pace of integration of Tibetan language and modernization.This article is based on the research of Tibetan speech synthesis based on the U-Tibet dialect.The article first adopts the traditional method of "parameter synthesis" based on the HMM model,and mainly analyzes the relevant technologies involved in the current text analysis of the front-end Tibetan speech synthesis.It mainly includes Tibetan phoneme analysis,Latin transliteration,segment labeling and prosody labeling,Tibetan pronunciation rules,Tibetan polysyllabic word analysis,special symbol processing,Tibetan automatic word segmentation,part-of-speech tagging and other front-end language models involved in the key Question,the final front-end text analysis results generate a set of prosodic texts to provide necessary information for the back-end acoustic model.Considering that the speech synthesized by the traditional method based on "parameter synthesis" of the HMM model has the disadvantages of unnaturalness,poor timbre,and insufficient intelligibility,this article finally introduces the most popular "end-to-end synthesis" method in the industry.That is,the "end-to-end synthesis"-Tacotron model based on deep learning.This paper studies the Tibetan speech conversion model based on the Encoder-Decoder structure of the attention mechanism,and realizes the Tibetan speech conversion technology with input as characters and output as spectrogram by drawing on the model architecture of mainstream language speech conversion.Finally,through experiments,the synthesis effect of the "end-to-end synthesis" deep learning model was evaluated objectively and subjectively.By comparing with the MOS score results based on HMM statistical parameter synthesis,it can be clearly seen that the effect of the Tacotron model synthesis is better than the effect of parameter synthesis.The evaluation criteria are analyzed,whether it is from the timbre,naturalness and intelligibility of the speech,The MOS scoring result based on "end-to-end synthesis" is greater than that of "parameter synthesis"("end-to-end synthesis" score 4.73 points(or 4.61 points)> "parameter synthesis" score 3.96 points).In addition,through a detailed analysis of the mainstream "end-to-end synthesis" Tacotron model,it can be seen that when the model is trained 25,000 times,the alignment effect of the attention mechanism and the synthesizing speech spectrogram have achieved good results.Regardless of the overall analysis of the synthesized speech,or the analysis of the synthesized speech from the three aspects of timbre,naturalness and intelligibility,the result obtained is that the score of the U-Tibet dialect synthesized using the Tacotron model is higher than that based on The score of the synthetic speech with statistical parameters.Therefore,the "end-to-end" synthesis method has research and application value in Tibetan speech synthesis. |