| Facial expression recognition(FER)is one of the most important research branches of affective computing.Due to its wide applications in human-computer interaction,intelligent robot manufacturing,and digital entertainment,this topic has received more and more attention,and gradually becomes a hot research area.Early research on FER mainly focused on single-label facial expression data in a laboratory environment with a frontal view,uniform illumination,and no occlusion.With the development of multimedia technology and artificial intelligence technology,the focus of the FER research has gradually transitioned to real-world facial data.This kind of data is various in terms of shooting scenes and angles,lighting conditions,occlusions,and labels,which are emotionindependent noise information.Although the current research on expression recognition has made some progress,existing methods still have some shortages due to the complexity of the facial images in real scenes,such as the robustness is weak for single-scene data,facing with the challenges of(ⅰ)the difficulty to utilize the complementarity between local-and global-features,as well as(ⅱ)the lack of dicriminativeness of features;and the generalization capability is limited for cross-scene data,facing with the chanllenges of(ⅲ)distribution shift between source-and target-scene data,as well as(ⅳ)unlabeled target-scene data.To address these challenge problems,this dissertation proposes some novel FER methods based on deep convolutional neural networks,and mines the affective semantic information with the guidition of emotion labels,then achieve the goal of improving robustness and generalization ability of models.The specific contents of this dissertation are summarized as follows:(1)To address the problem that it is difficult to effectively utilize the complementarity between local-and global-features,a single-scene FER method is proposed by multi-level key semantic analysis.A simple and effective local feature learning algorithm is designed,and multi-level attention mechanisms are developed to learn the key semantic information for effective FER.Specifically,a local patch generation module is designed by a sliding window on the feature maps,which overcomes the shortage of facial landmark detection and preserves the information at the edge of patches by overlapped windows.Moreover,some local and global multi-level attention sub-networks are constructed for the adaptive fusion of local detailed texture features and global macro profile features,so that the model can focus on the emotional semantic components.Furthermore,extensive experimental results on the unconstrained datasets illustrate that the proposed method can effectively improve the accuracy and robustness of FER.(2)To address the problem that the dicriminativeness of features is weak,a singlescene FER method is proposed by the collaboration of multi-branch discriminative semantics.The proposed method improves the diversity of representations through a multibranch network,and strengthens the separable property by discriminative feature learning techniques,then effectively improves the recognization ability of the model for facial expressions.Specifically,a multi-branch feature collaborative learning module is designed to obtain various semantic features,and a contribution score learning component is utilized to adaptively fuse these features for mining rich semantic information.Moreover,a discriminative semantic learning module is constructed to enhance the separable ability of representations by discovering the semantic relationship between the features and emotional labels,which reduces the difficulty of classification.Furthermore,extensive experimental results on large-scale real-scene unconstrained datasets show that the proposed method effectively improves the accuracy and robustness on both basic and compound facial expressions.(3)To address the problem that there exists a distribution shift between cross-scene facial datasets,a cross-scene FER method is proposed based on consistent representations and semantic alignment.The proposed method respectively models the relationship between cross-domain samples at the scene-and category-level,so as to learn cross-domain consistent representations and align cross-domain samples interactively to alleviate negative transfer.Specifically,a mutual information minimization module is designed to simultaneously distill the domain-invariant knowledge and eliminate the domain-sensitive information,which can reduce the domain discrepancy in terms of marginal distribution and realize scene-level domain adaptation.Moreover,a semantic metric learning module is developed based on the semantic relationship between representations and emotional labels to facilitate cross-domain information interaction,which bridges the domain gap in terms of conditional distribution and realizes category-level domain adaptation.Furthermore,extensive experimental results on multiple datasets show that the proposed method effectively reduces the domain gap and improves the model’s generalization ability on the target scene data.(4)To address the problem that it is difficult to get the semantic information from unlabeled target-scene data,a cross-scene FER method is proposed by category-aware and joint self-training.The proposed method tries to get high-quality initial target pseudo labels,and adaptively chose reliable ones for self-training to effectively extract target-scene semantic knowledge.Specifically,a contrastive warm up strategy is designed to enhance the semantic discriminativeness and cross-domain consistency of representations at both instance-and category-level,which benefits the generation of target pseudo labels.Moreover,a category-aware self-training module is constructed based on the average predictive entropy to model the recognition complexity of each category,which adaptively selects relative more trusted pseudo labels from the easy-to-recognize categories for source-target co-training to fully mine the latent semantic information of the target scene.Furthermore,extensive experimental results show that the proposed method can effectively improve the accuracy and generalization ability of the model. |