Font Size: a A A

Research On Tibetan Speech Recognition Based On Speech Spectral Features

Posted on:2022-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:W Z WangFull Text:PDF
GTID:2518306500956979Subject:Intelligent information processing
Abstract/Summary:PDF Full Text Request
Automatic speech recognition is the conversion of speech sequences into text sequences,which is a key technology to realize human-computer interaction.At present,the research and speech recognition technology of mainstream languages such as English,Mandarin and Japanese are mature.However,the Tibetan speech recognition technology is relatively backward and has fewer practical applications,because Tibetan is a minority language without a large-scale corpus,and the foundation of linguistic research is weak.To address the problems in Tibetan speech recognition,this paper designs and builds a Tibetan pronunciation dictionary,a Tibetan language corpus,and a language model,focusing on the extraction of acoustic features of Tibetan speech and the construction of a speech recognition model.The main works and innovations of this thesis are as follows:Firstly,a Tibetan pronunciation dictionary,a speech recognition corpus and a language model were built.The Tibetan linguistic knowledge and syllabic features were analyzed,and the consonants and vowels of Tibetan were used as recognition units to construct the U-Tsang corpus,and a Tibetan pronunciation dictionary and Tibetan language model were established.The corpus contains 18,000 pieces of data,with a total time of 11.26 hours,and a total of 20 speakers,including 8 male speakers and 12 female speakers.The pronunciation dictionary contains 16398 words.Secondly,a Tibetan speech recognition model with a hybrid architecture was established.The Mel-Frequency Cepstral Coefficient(MFCC)of speech was extracted to construct Tibetan speech recognition models based on Hidden Markov Model(HMM),Deep Neural Network(DNN),Convolutional Neural Network(CNN)and Long Short-term Memory(LSTM),and validated by experiments.The results show that the word error rates of four models,GMM-HMM,DNN-HMM,CNN-HMM,and LSTM-HMM,are 35.58%,33.38%,31.61%,and 25.35% under the same experimental conditions,respectively.And this is used as the baseline model to compare the performance with the end-to-end Tibetan speech recognition model built below.Finally,an end-to-end Tibetan speech recognition model based on spectrum features was built,and the recognition rate and generalization performance of the model was improved by data enhancement.Speech is transformed into a spectrogram feature by fast Fourier transform and the feature is used to train an end-to-end model.The word error rate of this model is 34.72%,which is better than that of the GMM-HMM model under the same experimental conditions.this thesis addresses the problem of the low recognition rate of end-to-end models in the small corpus by enhancing the data with noise addition.Experiments show that the word error rate of the model is reduced by6.19% after data enhancement.The performance of the model is better than the DNN-HMM and CNN-HMM models,and it has a stronger generalization in the natural environment.At the same time,the results of this thesis are superior to those of previous studies in the laboratory.
Keywords/Search Tags:Speech Recognition, Tibetan, Deep Learning, End-to-End, Mel-Frequency Cepstral Coefficient, Spectrogram
PDF Full Text Request
Related items