Font Size: a A A

Research On Mandarin-to-Tibetan Cross Lingual Speech Conversion Based On Deep Neural Network

Posted on:2019-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:W B RuanFull Text:PDF
GTID:2428330545983978Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Cross-lingual speech conversion is a main subject in the field of artificial intelligence research.It analyses the voice of the source speaker,and uses the speech conversion technology to get the desired speech which has the same features with the source speaker's on the sound quality.The Tibetan is one of the national minorities with a large population in China.Using the deep learning algorithm to study the Mandarin-to-Tibetan technology will well promote the communication between the Tibetan people and the Han people as well as effectively protect the rich Tibetan culture.Aiming at studying a Mandarin to Tibetan Lhasa dialect speech conversion technology,a method combining the speech recognition and speech synthesis is proposed in this thesis.Based on the method,a cross language speech conversion system based on deep neural network(DNN)is implemented.And the subjective and objective evaluation of the sound quality of the synthesized Tibetan speech and the converted speech is made separately.The main works and innovation of this thesis are as follows.A study of the Mandarin speech recognition method based on DNN is finished.Through the study,it is found that DNN intercepts some of the trained networks for its own feature training and the new features has a better effect than the Mel frequency cepstrum coefficients(MFCC)features do in speech recognition.Firstly,the pre-training,parameter modification and model optimization of DNN model are studied.And a DNN model for speech acoustic feature extraction is built on the Kaldi platform.Secondly,the MFCC feature is adopted to extract deep speech features with stronger transformability and stronger distinguishability.And these new features are used to train and implement the speech recognition system based on the DNN-HMM acoustic model.Finally,the results are got.The best of them is that the mono-phone error rate and word error rate of the DNN-HMM model is separately 19.62%and 27.12%lower,compared with that of the hidden markov model(HMM).A Mandarin-to-Tibetan speech conversion system is implemented.Firstly,a corpus of 800 Tibetan sentences was selected as the training corpus.And HMMs of the spectral parameters,time and fundamental frequency are got by using the expectation maximization(EM)algorithm and machine learning(ML)rule and corpus training.Then,the context decision-tree clustering algorithm is applied to complete the clustering of the model,and the prediction model of synthetic speech is obtained.Context sensitive HMMs are obtained by combining the context sensitive labels and the prediction models.At last,the Tibetan speech is synthesized by using the parametric speech synthesizer STRAIGHT.When Mandarin is given,the average accuracy of the semantic expression of Tibetan is evaluated based on the quality evaluation of the converted speech.The average accuracy of the single word,words and sentences is 65.40%,82.15%,98.15%respectively.
Keywords/Search Tags:deep neural network, hidden markov model, Mandarin speech recognition, Tibetan speech synthesis, cross lingual speech conversion
PDF Full Text Request
Related items