Font Size: a A A

Research On Speech Recognition Technology For Online Education Application

Posted on:2023-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:X K HuangFull Text:PDF
GTID:2557306827496214Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,online education is favored by more and more users because of its convenience,intelligence and many other advantages.Relying on intelligent detection tools to assist teaching,online education can provide "personalized" guidance to solve the learning problems of different user groups.In the field of online education,intelligent speech technology plays an important role,such as oral scoring,phonetic transcription in class,online video automatic subtitle generation and other functions,to help teachers and students improve the efficiency of teaching and learning,so that teachers and students more focused on knowledge learning itself.However,the current end-to-end speech recognition technology still has the following shortcomings :(1)the model has poor recognition ability for long-term speech recognition.(2)In the noise reverberation environment,the recognition rate of speech recognition model decreases seriously.(3)the background voice is wrongly recognized by the speech recognition model.The main work and contributions of this paper are as follows:The end-to-end speech recognition model based on Conformer framework and joint CTC training is established and studied to study the influence of progressive down-sampling and multi-scale attention mechanism on long-term speech recognition.Multi-scale attention mechanism combines convolution and self-attention mechanism to learn more speech representations of different scales and has better recognition effect in long speech.Experiments show that the multi-scale attention Conformer proposed in this paper can effectively improve the generalization ability of the model for long time speech recognition scenarios.Aiming at noise environment,we proposed the dual path TFCN speech enhancement model.By using the progressive learning strategy,the amplitude spectrum and the real and virtual components of the signal are modeled respectively,and finally the denoised speech is obtained.This method not only uses the information of amplitude spectrum,but also learns the phase spectrum through real and virtual components,so as to achieve better denoising effect.In addition,the number of parameters in this method is much less than that in Wave Net,Unet network model.Aiming at the situation that the speech recognition model incorrectly recognizes non-target speech due to the background voice,this paper proposes the dual path TFCN target speaker extraction algorithm,which projects the recognized speech and registered speech into the same feature space through the shared audio encoding network,and then obtains the speaker features in the registered speech through multi-task learning.It is processed by speaker attention mechanism and TFCN network to eliminate other background human voice interference.Experiments show that the TFCN speaker extraction algorithm based on time-frequency domain is superior to the mainstream Spex model in the distortion evaluation indexes such as SI-SDR and SDRTo sum up,for the recognition of long speech,environmental noise,background voice and other complex situations that the speech recognition system may encounter in online education scenes,this paper proposes a feasible,lightweight and robust combination of speech recognition models,which can effectively solve the recognition problem in complex scenes.
Keywords/Search Tags:Deep learning, speech recognition, speech enhancement, multi-task learning, online education
PDF Full Text Request
Related items