| Emotion recognition typically utilizes multiple information sources such as physiological signals and behavioral features to infer different emotional categories.Multi-modal emotion recognition techniques based on audio and video have received widespread attention due to their robustness.Currently,most research methods have not fully considered the temporal characteristics of modalities and the complementary nature of modality information,which makes it difficult to efficiently integrate different modal features.Additionally,the diversity of the identity of the recognition subjects has introduced many interference factors to the model learning process,making it difficult to achieve significant improvements in accuracy.To overcome these obstacles,the main efforts in this research are:(1)This paper proposes a multi-modal emotion recognition method that combines attention mechanism with a dual-sequence LSTM network.Various types of attention mechanisms are added to to better capture relevant information for audio-visual emotion recognition.Firstly,for the video module,a highly efficient ResNeXt50 network is combined with a coordinated attention mechanism to capture the position information and long-term spatial dependencies of the video image sequence.For the audio module,a one-dimensional CNN with self-attention mechanism is added to learn semantic features.Secondly,the features of the two modalities are separately processed by an embedded dual sequence LSTM network with self-attention mechanism,and the fused representation is obtained to generate the final emotional output.The self-attention mechanism and the dual sequence LSTM network ensure the complementarity and completeness of the modality features.Different feature extraction networks are combined with different attention mechanisms to exhibit their optimal features and achieve efficient model expression.Through comparative experiments on two datasets,RAVDESS and eNTERFACE’05,as well as ablation experiments,it has been verified that the proposed algorithm is capable of accurately processing temporal and complementary information and reducing redundant information in the fused features.(1)This paper proposes a multi-task feature space decoupling based audio-visual emotion recognition method,which reduces the influence of identity-related representations on emotion classification by decoupling them from audio-visual features.First,the emotional and identity-related encoders are used to map the fused audio-visual features into different task-specific hidden spaces.Then,a multi-task training method is adopted to learn emotional and identity-related latent representations,and an emotion-identity coupling loss function is introduced to measure the classification loss in emotion and identity recognition tasks.The weights of each task are dynamically updated in an adaptive manner to guide model parameter learning and improve classification accuracy.Experiments on the RAVDESS and eNTERFACE’05 datasets,as well as ablation experiments and feature visualization experiments,verify that the multi-task feature space decoupling based audio-visual emotion recognition method can enhance emotion recognition accuracy by weakening the coupling between emotional and identity-related features. |