Automatic Speech Recognition(ASR)algorithm is considered to be a bridge for smoother communication between people,and between people and machines.Especially,with the continuous popularization and development of deep learning in the field of speech technology,the error rate of speech recognition system has been significantly reduced.However,the audio collection environment in the real scene is usually extremely noisy,and the speech recognition algorithm that only uses audio data as the feature learning subject does not perform well in the recognition accuracy.In this thesis,visual information is used as supplementary information,and the audio-visual bimodal feature fusion method is used to conduct in-depth analysis and research on speech recognition technology to further improve the recognition accuracy.First of all,this thesis builds an audio-visual dual-modal speech recognition algorithm(AV-Lip Net)with excellent recognition effect.In addition,in order to enable the network to have the ability to select information,this thesis builds an audio-visual dual-modal speech recognition algorithm based on the convolutional block attention module(AV-CBAM-Lip Net).Perform feature attention operations in the space and channel dimensions to improve recognition accuracy.At the same time,considering the similarity of features between adjacent frames in the input data sequence,this thesis constructs the Audio-visual Time-space-channel Attention Module Lip Net(AV-TSCAM-Lip Net)based on the AV-CBAM-Lip Net,design a time attention mechanism to extract significant information in the time frame dimension.Experimental comparison verifies that the proposed AV-TSCAM-Lip Net has better recognition performance and convergence speed than other deep learning algorithms.Although the voice recognition system based on visual information is stable in a noisy environment,it will also face the challenges of complex conditions such as lighting conditions and low resolution.In this thesis,Feature Pyramid Networks(FPN)is used to extract multi-scale features of visual information,and the low-level detail information and high-level semantic information are merged,so that the algorithm can pay more attention to the changes of subtle features.On this basis,in order to further effectively integrate visual features with auditory features,a cross-modal AV-TSCAM-Lip Net algorithm based on feature fusion(Cross-modal AV-TSCAM-Lip Net)is proposed,which adopts a bidirectional fusion structure of voice and image streams.Cross join the two-modal data,allowing cross-modal intermediate representation interaction,and better organically combine audio-visual dual-modal features to compensate for the impact of environmental factors.Compared with other algorithms,the proposed Cross-modal AV-TSCAM-Lip Net algorithm not only achieved the lowest recognition error rate and faster convergence speed,but also maintained good recognition performance under different noise intensities,which verified the feature fusion based Cross-modal AV-TSCAM-Lip Net algorithm recognition performance and strong noise resistance. |