Font Size: a A A

Deep Neural Network For Automatic Mispronunciation Detection And Error Diagnosis

Posted on:2017-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:W P HuFull Text:PDF
GTID:1108330485451546Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid globalization of regions of different languages demands more advanced foreign language proficiency among people who need to interact across the language barriers. For non-native language learners, the traditional one-to-one, teacher-student interaction and communication is the most effective, but it is also too pricey to be af-fordable to many learners. Computer Assisted Language Learning (CALL) systems, powered by the advancement of speech technologies, can bridge the gap between the uneven supply and demand among language learners and teachers. It has become a ubiquitous learning tool with handy smart phones, tablets, laptop computers, etc. How-ever, as an indispensable component of CALL system, phone level mispronunciation detection, which aims at detecting or identifying pronunciation errors or deficiency at the phone level in a high precision, is still challenging.In recent years, deep learning, a new machine learning approach, has achieved great breakthrough in many areas, particularly in speech recognition. It has reduced the Word Error Rate (WER) by a large margin on many benchmark databases. In this thesis, we did an in-depth exploration of Deep Neural Network (DNN) for mispronunciation detection. We first revised the standard acoustic model directed for pronunciation learn-ing applications, and then improved the performance of mispronunciation detection by applying DNN from three different perspectives: posterior probability, hypothesis test-ing and 2-class pattern classification.This thesis first improves the discrimination of acoustic model to Mandarin tone or lexical stress by using embedded FO contour. Different from spectral features, FO contour is only quasi-continuous and can disappear in the unvoiced regions. In tradi-tional GMM-HMM systems, a heuristic approach is to interpolate FO in the unvoiced regions. In this thesis we treat FO contour in DNN by interpolating FO in the unvoiced regions or without interpolation. The experimental result shows that DNN framework, with or without interpolation in the unvoiced regions yield similar performance in both tone and base syllable recognition.This thesis also studies the traditional Goodness of Pronunciation (GOP), in terms of senone probabilities, approach to mispronunciation detection and extends it into DNN-HMM systems. Based upon the observation that a mispronunciation occurred in the current phone will have a negative influence on its neighboring phones when calculating GOP, we further propose a more advanced GOP computation. The effec-tiveness of the revised GOP is verified on both phone level mispronunciation detection and corresponding phone diagnosis tasks on a large scale, Mandarin learning database.This thesis presents a new hypothesis testing based approach to mispronunciation detection in the phonetic space. The phonetic space is composed of ASR senones, i.e., tied HMM states, whose posteriors are learned from acoustic features by DNN. The variabilities of speakers, transducers and environments are equalized by DNN training, the derived phonetic space is then more phonetically sensitive than the acoustic space, hence more suitable for mispronunciation detection. Different from traditional hypoth-esis testing approaches, we build the correct and incorrect pronunciation models at the senone instead of the phone level, which enables a good description of its pronunciation space for each erroneous phone. In addition, to further improve the discrimination in the phonetic space, this thesis proposes a new hidden state tying approach, by using pho-netic features and KL divergence based distance measure, to reconstruct the decision trees, senones and acoustic models. Experimental result shows that the newly obtained models further improve the performance of mispronunciation detection.Lastly, this thesis presents a new neural network based multi-task learning frame-work for mispronunciation detection to address the problem of disproportional distribu-tion of samples across different phones to build a robust classifier independently. Multi-phone specific 2-class logistic regression classifiers are built above a shared, common neural networks. By training all classifiers in a joint neural network, it streamlines the process of training multiple individual classifiers separately. In addition, with this shared hidden layer structure, it not only helps to extract more predictive features to distinguish correct or incorrect phones, but also improves the generalibility for some phones which has few samples to build a good independent classifier. These properties are verified on two English learning and Mandarin language learning databases.
Keywords/Search Tags:Mispronunciation Detection, Computer Assisted Language Learning(CALL), Deep Neural Network(DNN), Hidden Markov Model(HMM), Automatic Speech Recognition(ASR)
PDF Full Text Request
Related items