| Emotion recognition,one challenging cutting-edge research direction in the field of artificial intelligence,refers to inferring the emotion state of the subject through the synergistic integration of the information from multiple modalities.Based on the number of collected signal forms(modality),emotion recognition can be divided into unimodal emotion recognition and multimodal emotion recognition(MER).Compared with unimodal emotion recognition,MER has better recognition performance and robustness by the synergistic integration of the information multiple modalities,and therefore has become the mainstream research direction in the field of emotion recognition.To this day,MER has been integrated into all aspects of our lives,playing an important role in scenarios such as medical diagnosis,intelligent driving,and opinion monitoring.However,although MER has performed well in various industrial applications,two problems in this direction still need to be solved urgently:How to effectively handle the feature sequence for each modality,and how to effectively fuse them.To address these two problems,this paper proposes two lightweight MER methods from different perspectives,based on three modal information:text,visual,and audio,as follows:(1)In this paper,we propose a cross-modal dynamic temporal convolutional network(CMDTCN),to address two problems related to MER at the utterance level:How to explore the interactions between feature sequences of different modalities(a.k.a.inter-modal interactions)and how to effectively model the interactions within each modal feature sequence(a.k.a.intramodal interactions).Inspired by the characteristics of the dynamic convolutional network and the temporal convolutional network(TCN),we propose the dynamic temporal convolutional network(DTCN).More specially,DTCN is devised to effectively model the temporal relationships of feature sequences while mitigating the lack of learning capability of the TCN for features within each time step.On the one hand,we have made full use of the DTCN to model the inter-modal interactions,which not only effectively mitigates the effect of the redundant information on the inter-modal interactions,but also fully learns the temporal relationships in the inter-modal interactions to obtain powerful and compact multimodal fusion-based representations.On the other hand,we propose a new query vector generation method that uses the multimodal fusion representations generated by the DTCN as query vectors to enhance the feature sequences for each modality,so as to avoid the influence of redundant information on the intra-modal interactions and thus to pay more attention to important information underlying these feature sequences,which helps the model to more adequately model inter-modal interactions.(2)In this paper,a multimodal semantic capsule integration network with channel expansion and fusion(MSeCIN-CEF)is proposed for extracting the high-level semantic information of each modal feature sequence as well as fusing different modal feature sequences.Specifically,the MSeCIN-CEF is mainly composed of two sub-networks,which are the multi-channel deformable convolutional network(MCDeCN)and the multimodal semantic capsule fusion network(MSeCIN).First,this paper introduces the idea of feature channels in computer vision in MCDeCN,and the "channel expansion" operation in MCDeCN uses convolutional kernels of different sizes to map each modal feature sequence to multiple feature channels to extract their semantic information at different granularities.Then the "channel fusion" operation uses a deformable convolutional network to fuse all feature channels hierarchically,which can learn the implicit semantic information in each modal feature sequence by combining the semantic information of other modalities.Secondly,MSeCIN first combines all modal features on each word separately to generate local information capsules,then the local information capsules are processed by an improved dynamic routing algorithm and integrated into global information capsules,and finally used for emotion prediction.Extensive experimental analyses and discussions were conducted on three multimodal emotion datasets,the CMU-MOSI,CMU-MOSEI,and IEMOCAP,and both MER methods proposed in this paper achieved satisfactory performance that is comparable to many existing state-of-the-art models.CM-DTCN and MSeCIN-CEF achieved 79.1%and 81.1%accuracy for two-class recognition in the CMU-MOSI dataset,79.9%and 79.7%accuracy for two-class recognition in the CMU-MOSEI dataset,and 83.2%and 82.1%accuracy for four-class recognition in the IEMOCAP dataset,respectively,all of which reached the current advanced algorithms.The average number of model parameters of CM-DTCN and MSeCIN-CEF on the three datasets is only 1386157 and 826129,respectively,which has the characteristics of low complexity and lightweight while having excellent performance. |