Font Size: a A A

Research And Application On Speech Recognition For Complex Scenes

Posted on:2023-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:B W TangFull Text:PDF
GTID:2568307043988519Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Under the background of deep learning,automatic speech recognition(ASR),as an important field in artificial intelligence and pattern recognition,has developed rapidly.With the popularity of intelligent devices,speech recognition technology has been widely used in daily life and work.However,when people use speech recognition technology in complex and realistic scenes,such as some specific fields,noise environment,speech recognition faces greater challenges.This thesis studies the two kinds of speech recognition technology in complex scenes.Recognition of Out-of-vocabulary(OOV)in specific fields is a problem to be solved for End-to-end automatic speech recognition.The OOVs are those words that occur frequently only in a specific field.The recognition of the acoustic model trained on the conventional speech dataset is not well for OOVs,and the collection of OOVs speech datasets is very expensive.It is very challenging to guide the automatic speech recognition system to recognize OOVs.Secondly,the noisy background is also a complex scene that speech recognition often faces.The audio-video speech recognition method(AVSR)with visual modal assisted speech recognition is an effective method to solve the performance slump of the automatic speech recognition system when facing noisy scenes.The visual mode is unaffected by noise,the audio mode can be recognized more accurately in a clean background.The key to AVSR is a reasonable two-mode fusion method.The research of speech recognition technology in complex scenes can promote the application of speech recognition technology in more realistic scenes.This thesis mainly focuses on the end-to-end automatic speech recognition technology in complex scenes,the main content is as follows:(1)In view of the lack of the OOVs’ speech dataset,and the problem that the current speech recognition system can not recognize OOVs.In this thesis,the new speech recognition system is proposed to solve the recognition problem of OOVs in a specific field by introducing the Out-of-vocabulary Spelling Correction Model.By using the Transformer network to construct OOV’s Spelling Correction(OOV-SC)as a post-processing of the acoustic model,the Alignment Loss is designed to improve the correction effect of the part of OOVs in the recognition results of the acoustic model and reduce the error correction of the part of Non-OOVs.In addition,this thesis optimizes the training method of the correction model by training the OOV-SC model using synthetic speech recognition results and real speech recognition results to avoid acoustic model deviation.The experimental result shows that the speech recognition system with the OOV-SC model can improve the recognition effect of OOVs’ audio,and will not affect the recognition effect of conventional speech.(2)In view of the existing AVSR methods ignoring the influence of audio mode when feature extracting of video mode or ignoring the complementarity between modes in Transformer Encoder.This thesis presents a novel audio-video recognition method based on Transformer.In the process of encoding,the two branch encoders encode the features of audio mode and video mode respectively,and the information interaction block obtains the complementary information of each other’s modes layer by layer.Finally,feature fusion is carried out in the decoding process.At the same time,in order to promote information interaction in coding process,Cross-Reconstruction Loss is used.the experimental results show that the proposed method can improve the performance of audio-video recognition and the robustness of the model.This thesis researches the speech recognition in complex scenes,explores effective methods to solve the corresponding problems,and proves the effectiveness and robustness of the method through experiments.Finally,the speech recognition system can be stable and usable in complex scenes.
Keywords/Search Tags:automatic speech recognition, out-of-vocabulary, audio-visual recognition, multimodal fusion, Transformer
PDF Full Text Request
Related items