Font Size: a A A

Amdo Tibetan Speech Recognition Based On Deep Neural Network

Posted on:2020-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:C C M GengFull Text:PDF
GTID:2438330578464436Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
Speech recognition is an important branch of pattern recognition which aims to transform human speech information into text information.In Chinese and English speech recognition,compared with the traditional Gaussian mixture model-hidden Markov model,the recognition performance of deep neural network has achieved a qualitative leap.However,until now,there have been few studies on Tibetan language recognition.In particular,Tibetan language belongs to low-resource language and has such characteristics as voiced consonants tend to be clear,vowels have long and short differences,and unit sounds increase,making Tibetan language speech recognition still face many challenges.Among the three dialects of Tibetan,Lhasa,Kangba and Amdo,there are relatively more researches on the speech recognition of Lhasa dialect,but relatively few on Amdo dialect and Kangba dialect.In particular,the application of deep neural network in the speech recognition of Amdo Tibetan language has not been deeply studied.Therefore,this paper discusses the application of end-to-end bidirectional long and short term memory network in Amdo Tibetan speech recognition based on the structure of Amdo Tibetan acoustic model.The research contents are as follows:1)Amdo Tibetan corpus has been established.The 1,278 monosyllabic words with the highest frequency in Tibetan were collected,and the speech samples of Amdo dialect in Tibetan were collected for each word.The sampling frequency was 16 KHZ,the quantization accuracy was 16 bit,and the sound was recorded in the room with noise no higher than 50 dB by Cool Edit Pro software.2)Preprocessing.Pre-emphasis,framing and windowing preprocessing operations are carried out on the voice signal of the Tibetan Amdo dialect to eliminate the influence of aliasing,high-order harmonic distortion,high-frequency and other factors on the quality of the voice signal caused by human voice organs and voice signal acquisition equipment.Through preprocessing,the speech signal can be more uniform and smooth to ensure the extraction of better parameters in the feature extraction stage,thus improving the speech recognition performance.3)Feature extraction.In the Amdo Tibetan speech recognition task,considering the characteristics of Tibetan pronunciation,the influence of different feature extraction methods on system performance is discussed.In this paper,the features are extracted by the traditional Mel frequency cepstrum coefficient and convolutional neural network.The experimental results show that the performance of feature extraction by convolutional neural network is better than that of Mel frequency cepstrum coefficient.4)Acoustic modeling.Bidirectional long and short term networks are suitable for processing sequence problems.Connectionist temporal classification technology does not need to label,align and post-process data in advance.Therefore,the end to end acoustic modeling of Amdo Tibetan language is realized by combining Connectionist temporal classification technology with bidirectional long short term network.Experimental results show that the end to end Amdo Tibetan acoustic model based on bidirectional long and short term memory network has better performance.
Keywords/Search Tags:Deep neural network, Bidirectional long and short term memory network, Amdo Tibetan, Acoustic modeling, End-to-End, Connectionist temporal classification
PDF Full Text Request
Related items