Font Size: a A A

End To End Mispronunciation Detection And Diagnosis

Posted on:2021-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q FengFull Text:PDF
GTID:2518306569494804Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the growing needs of learning languages,higher performance of Computer-Aided Pronunciation Training(CAPT)systems are demanded.One of the key technologies of CAPT system is mispronunciation detection and diagnosis(MD&D).Camparing to traditional teachers,CAPT system has the advantages of low cost and high flexibility,which is favored by more and more L2 language leaners.MD&D can be treated as a special type of automatic phone recognition.When the recognized phones differ from the canonical productions(obtained from the text promts presented to the speakers),mispronunciation detection and diagnosis are achieved respectively.For the task of mispronunciation detection,this paper first explores the effect of traditional unsupervised mispronunciation detection methods.In order to varify the necessity of designing and training models for the task of mispronunciation detection and diagnosis,this paper compares the results of different goodness of pronunciation algorithms in unsupervised mispronunciation detection methods.For the task of mispronunciation detection and diagnosis,this paper first constructs a set of data processing process,including audio feature extraction,phoneme information normalization and data enhancement strategy.Then,I design several single-mode phoneme sequence labeling models with different structures.Through comparative experiments,the effectiveness of the data enhancement strategy is varified and the best single-mode phoneme sequence labeling model structure is proved.For the fact that text information is known before the task of mispronunciation detection and diagnosis,a multimodal phoneme sequence labeling model is constructed.Through attention mechanism,this model can align the audio information at each position with the text information to achieve better phone classification results at each position.Experimental results show that the multi-modal phone sequence labeling model proposed in this paper has a significant improvement in all indicators compared with the single-mode phone sequence labeling model.According to the characteristics of the dataset that it was recorded by people from different countries with different first languages,this paper explores the strategy of improving the model effect by integrating the first language information into multimodal phone sequence labeling model.In order to achieve this target,this paper explores the construction of multi-task models and multi-input models.The experimental results show that the first language information can improve the effect of phone sequence labeling model to a certain extent,and multi-input model is the best way to integrate first language information.The best model designed in this paper is the first mispronunciation detection and diagnosis model integrates both text and first language information.Our experiments show that,among all the proposed models variations and existing models compared in out experiments,the model designed in this paper reaches the best performance on open dataset L2-ARCTIC.
Keywords/Search Tags:computer-aided pronunciation training system, mispronounciation detection and diagnosis, end-to-end model, multimodal fusion, multi-task learning
PDF Full Text Request
Related items