Facial expression is the most efficient and natural non-verbal way of transmitting human emotions.With the advancement of technology,the need for intelligent human-computer interaction system is increasing.Because understanding human facial expres-sions can help human-computer interaction system analyse human emotional state and respond to them smoothly.Automatic facial expression recognition has become a pop-ular research area in computer vision and affective computing field due to its potential commercial value.The researches of expression recognition generally involve three tasks: discrete ex-pression classification,facial Action Unit(AU)activation detection and Valence-Arousal(VA)regression.However,most previous researches on expression recognition have fo-cused on the single task,and multi-task expression recognition methods have been domi-nated by traditional parallel branch pattern,with few studies jointly taking into account the intrinsically connections between three emotional tasks to propose a more subtle multi-task learning method.At the same time,facial expression of the same type can still be somewhat different due to identity attributes such as skin colour,age,gender and appear-ance,which undoubtedly makes the learning of expression recognition more difficult.Based on the deep learning theory,this thesis proposes a multi-task expression recog-nition method based on the feature-disentangled mechanism.Specifically,firstly,an implicit feature-disentangled backbone network is built for the influence of identity at-tributes and other features.The backbone network uses the prior knowledge of faces from large-scale face datasets to model the expression features as a subtraction of facial features and identity features,effectively reducing the influence of identity attributes and improving robustness.Compared to explicit feature-disentangled methods,this method requires less computational resources and is easier to train.Secondly,based on the hi-erarchical progression between different expression recognition tasks,this paper designs an AU→expression classification→VA streaming multi-task learning module to learn the progressive flow from local action units of the face to the overall emotional feeling state,which also facilitates the transfer of effective gradients.Thirdly,to address the prob-lems of class imbalance and overfitting,this thesis adopts a compound loss function op-timization strategy and a feature space expansion strategy.The former takes into account the optimisation objectives of the model in multiple ways; the latter includes the semi-supervised learning for the test set,the ”double labeling method” to solve the problem of semantic ambiguity of expressions,and specific data augmentation method.In addi-tion,to make good use of the timing information in the sequences,this thesis introduces two schemes,the GRU module and the 3DCNN module,respectively,to automatically learn the similarities and differences between consecutive frames.Finally,this method achieves the state-of-the-art results in all three tasks of the Aff-Wild2 dataset,with scores of 0.697,0.777 and 0.490 for the AU,expression classification and VA tasks respectively.The comparison experiments and the analysis of the ablation experiments successfully validated the excellence of the method and the effectiveness of each module. |