Research On Technologies Of Audio-Visual Bimodal Speech Recognition Based On Attention Mechanism

Posted on:2022-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:M J Liu

Full Text:PDF

GTID:2518306311492544

Subject:Information and Communication Engineering

Abstract/Summary:

Speech recognition is a very important technology in the fields of artificial intelligence and machine learning.It has been studied widely,and more speech products have been put into use,bringing convenience to life.However,in the face of complex environment,such as multi-person background,mixed or missing voices,speech recognition technology needs to be further improved.The addition of video information can enable speech recognition to effectively deal with more complex environment.More and more experts and scholars are also engaged in the research of Audio-Video Bimodal Speech Recognition(AVSR)technology.Compared with single mode,AVSR can effectively improves the inadequacy of single mode.For example,the results of Audio Speech Recognition(ASR)is greatly reduced in the case of severe noise pollution,and the uncertainty of homophones exists in Video Speech Recognition(VSR),which can be compensated under dual mode.But there are also greater challenges.On the one hand,acoustic speech has a good feature-mel frequency cepstral coefficients(MFCC),but the features of video are diversified,which visual features can improve results of ASR and match with MFCC is a difficult point.On the other hand,it is hard that how to effectively integrate two related data streams running at different frame rates.Because in most cases,the accuracy of VSR is lower than ASR,and inappropriate fusion may even reduce the original result of speech recognition.In view of the above two problems,we study the key technologies in the AVSR model,including feature extraction,information fusion and classification recognition,and put forward innovative methods on these two problems.The main work includes the following aspects:1.An improved image denoising algorithm using graph frequency filtering is proposed,which improves the video images quality and makes preparation for extracting distinctive video features.Firstly,the weight matrix is constructed according to the correlation attribute between image pixels.Then,combined with laplacian matrix,graph fourier transform is carried out,and frequency filtering formula is obtained.Then,we explore the formula,put forward a more general filtering formula,and optimize the formula to improve the image quality.Compared with gaussian、wiener and basic frequency filtering methods,this method achieves better denoising effect.Then,different from Convolutional Neural Networks(CNN),we design the Residual Network(ResNet)architecture to extract video features,because ResNet can be designed deeper and extract more High-level features.2.An AVSR model based on attention mechanism is designed,which takes into account the early and late fusion of features and improves the effectiveness of information fusion.In the method design,first in the encoding stage,the audio and video feature information is aligned and corrected through the attention mechanism,and then the corrected audio features are obtained,which realizes the pre-fusion.Then in the decoding stage,two independent attention mechanisms are used,one for the video features and the other for the corrected audio features.After attention mechanism,the two vectors are spliced to decide the final recognition,which realize the post-fusion.The clever use of attention mechanism effectively solves the information fusion problem caused by different rates and length of audio and video.The video information is assisted twice,which greatly improves the recognition result of the model under noise and increases the robustness of the model to noise.Through the analysis of experimental data,it is found that excessive participation of video information will also lead to the degradation of mode performance under clean conditions.As for whether the pre-fusion or post-fusion of video information is required,a model selection method based on Signal-To-Noise Ratio(SNR)estimation is proposed,which takes advantage of different models and can deal with recognition tasks in different environments.3.The whole research process is not limited to AVSR proposed,but also includes ASR,VSR and post-fusion mode.We have done experiments on the open database GRID,and explored the influence of visual information on speech recognition under different noise pollution conditions.Compared with ASR and VSR,our method model get very good results.It not only achieves better recognition results in the case of serious noise pollution,but also achieves good results in the case of good speech signals.Compared with other experiments on GRID database,the model proposed also achieves certain improvement.Moreover,the model is modularized,which is convenient for application and transplantation.

Keywords/Search Tags:

Attention Mechanism, Audio-Visual Bimodal, Image Denoising, Feature Fusion, Speech Recognition

Related items

1	Bimodal Speech Recognition Technology Research Based On Audio And Video
2	A Study On Bimodal Audio Visual Speech Recognition Based On Deep Learning
3	Research On Audio-Visual Dual-Modal Speech Recognition Algorithm Based On Feature Fusion
4	Speech Endpoint Detection Based On Audio And Visual Features
5	Study On Cross-modal Speech Recognition Methods With Fusion Lipreading
6	Research On Children’s Emotion Recognition Based On The Fusion Of Speech And Text Bimodality
7	Audio-Visual Multi-Modal Fusion Approach Research And Application
8	Research On Speech Separation Algorithm Based On Deep Learning
9	Research On Expression And Speech Bimodal Emotion Recognition Of Children
10	Research On Speech Recognition Method Based On Feature Fusion Under Attention Mechanism