| In today’s society,with the popularization of second language learning,especially English learning in the whole society,computer-assisted language systems,as a convenient learning method,have received more and more attention,and at the same time,their requirements have become higher and higher..Spoken language learning is an important part of language learning.The importance of mispronunciation detection and diagnosis technology as an aid to oral language learning is self-evident,and it is also an important part of computer-assisted language learning system.Compared with the one-to-one teaching of oral teachers,the mispronunciation detection and diagnosis in the computer-assisted language learning system can be deployed on devices such as computers and mobile phones,which has the advantages of high flexibility and low cost.The mispronunciation detection and diagnosis task is a special speech recognition task.Its recognition target is the phoneme sequence of the user’s pronunciation and compares it with the standard phoneme sequence to find out the type and location of mispronunciation.This paper firstly introduces the traditional unsupervised mispronunciation detection method based on statistical model.Since the traditional method can only detect mispronunciation but cannot diagnose specific mispronunciation,and its implementation process is complicated and the steps are cumbersome,a special method for mispronunciation is introduced.The necessity of designing models for mispronunciation detection and diagnosis tasks.Due to the scarcity of labeled data in the research of mispronunciation detection and diagnosis,the training of the model is insufficient.This paper proposes to introduce a self-supervised pre-training model into the task of mispronunciation detection and diagnosis.Feature extraction ability to improve the problem of data scarcity.At the same time,in view of the fact that the text information read by the user is known information in this task,a multi-feature mispronunciation detection and diagnosis model is constructed by combining the text features and audio features through the attention mechanism to achieve the purpose of feature enhancement.,to improve the performance of the model.This paper conducts experiments on the proposed model on the L2-ARCTIC dataset and the TIMIT dataset,and validates the effectiveness of introducing the self-supervised pretraining model into the mispronunciation detection task and multi-feature model through experiments,and validates this paper by comparison.The proposed model has better performance than the baseline model. |