Font Size: a A A

End-to-end Pronunciation Error Detection In Spoken English Based On Multimodel

Posted on:2022-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y TanFull Text:PDF
GTID:2505306494479884Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In today’s world,economic globalization is the trend of times,the communication between countries is getting closer and closer,more and more people begin paying attention to oral English learning,the development of computer-assisted language learning makes it more convenient for people to learn oral English,but as its core,the detection and correction of wrong pronunciation still have shortcomings.At present,the detection and correction of pronunciation error just rely on the discriminant of the speech signal,which need to improve their accuracy.Especially in noisy environment,the accuracy is reduced significantly.There are different facial visual features in many English phonemes,in particular,almost all of the vowels can be divided with the degree of opening and tightening of the lips in appearance.In view of this,a multi-modal end-to-end English pronunciation error detection and correction model based on audio-video signal of pronunciation,,and uses rich audio and video features for pronunciation error detection,which improve the accuracy of error detection to a large extent,especially in a noise environment.Aiming at the shortcomings of the current lip feature extraction algorithm,which is too complicated and insufficient in characterization ability,a feature extraction scheme based on the opening and closing Angle of lips was proposed.The lip image frames are obtained by video framing.After image denoising,regression tree algorithm based on gradient enhancement was used to gain the key point information of lips.The scale normalization was carried out to remove the influence caused by the speaker’s tilt and movement.Finally,the opening and closing Angle of lips was calculated by mathematical geometry,and the eigenvalues of lips were generated by combining the Angle changes.In order to give full play to the role of lip features in oral pronunciation error detection,a multimodal feature fusion model based on lip Angle features was proposed.The model interpolated lip features constructed based on opening and closing angles,aligned and fused audio and video features in time sequence,and realized feature learning and classification through two-way LSTMSOFTMAX layer.Finally,the end-to-end pronunciation error was measured by CTC.The experimental results from the GRID audio and video corpus and self-created multiple-modes test set show that the proposed model has a higher error detection rate than the traditional single-mode acoustic error detection model,and the error detection rate of lip vowels is improved more obviously.The proposed model has better anti-noise performance than the traditional acoustic model,which is verified by audio-video corpus with white noise added.In order to solve the problem of forced alignment of audio and video information,an end-toend multi-mode decision fusion model was proposed.Two independent bidirectional LSTMSOFTMAX layers are used to learn audio and video features,and the manner with the highest accuracy of phoneme sequence recognition is obtained by decision fusion and CTC layer.The weighted decision fusion strategy is compared with the Dempster-Shafer fusion strategy,and the audio-video corpus and multimodal test set verify that the models constructed by the two decision fusion strategies both have excellent phoneme error detection rate and anti-noise performance.Finally,based on the principle of pronunciation,corrective suggestions are provided for the wrong pronunciation mode and pronunciation position.
Keywords/Search Tags:pronunciation error detection, multimodal, End to end, feature fusion, speech recognition
PDF Full Text Request
Related items