| Human-computer voice interaction is a technology that enables people to talk to computers,and speech synthesis—converting text to speech is typical of human-computer voice interaction technology[1].In recent years,speech synthesis technology has made remarkable progress in the academic and industrial circles,and has achieved remarkable results in terms of intelligibility,sound quality and sense of hearing,but the naturalness has not yet reached the level of human voice.With the increasing demand for people’s intelligent life,intelligent voice systems such as voice control,voice translation,and voice navigation are continuously put into use,and the demand for related technologies is growing,leading the development direction of voice synthesis technology.It is imperative to explore new ways to improve the naturalness and expressiveness of speech synthesis.In this essay,research in Uyghur language speech synthesis technology,firstly review the research status of the Uyghur speech synthesis system and carry out the following research work on the existing problems:Due to the lack of public Uyghur speech synthesis corpus,this study established the required Uyghur speech synthesis corpus and speech library.Based on the grammatical structure,rhythm layer structure and phonetic features of Uyghur language,the front-end text processing problem of Uyghur speech conversion system is studied.The mapping from the language layer to the voice layer is clarified through text processing.In order to improve the robustness of the text analysis module,a knowledge base,a rule base and a tree library are constructed.Taking the rhythm phrase as the background,the method of word-of-speech adjustment and the technical division of prosodic and rhythm phrases are studied.In order to realize the Uyghur speech synthesis system based on Hidden Markov Model(HMM)and construct its framework,this study counts the context attribute set and problem set of all possible phonemes according to the Uyghur language features,and optimizes and experiments on it.verification.The rhythm phrase rhythm is adjusted by using the improved system training process and the phoneme duration model,thereby improving the naturalness of the synthesized speech.By studying the language features and output acoustic characteristics of the neural network input,the training model framework is constructed and compared with different neural network models.Although the naturalness of Uyghur speech synthesis system based on HMM has reached the application standard,it still has a certain gap with the actual application requirements.In order to obtain a more natural synthetic system,this study tried a neural network approach.The subjective and objective test methods were used to evaluate the synthesized speech quality.It was found that the Uyghur speech synthesis system based on Bi LSTM(bidirectional long-term and short-term memory neural network model)is superior to the speech based on parameter synthesis method in terms of continuity and fluency.Synthetic systems whose naturalness meets satisfactory indicators above the applied standards.On the basis of a synthetic system with high naturalness of speech,in order to further improve the emotional expression of synthesized speech,the text with emotional features is studied,and the input text is emotionally classified in units of rhythm phrases,thereby obtaining the above text.Emotional language features.Finally,the Bi LSTM-based Uyghur speech synthesis system is successfully applied to the Uyghur-Chinese translation system,which consists of three major modules:speech recognition,speech synthesis and machine translation,which improves the naturalness of the synthesized speech of the system.In addition to improving the naturalness of the synthesized speech in the Uyghur translation system,this study has realized the application value in the Uyghur speech synthesis field,and can also provide reference for the speech synthesis research and application in languages such as Kazakh and Kyrgyz. |