| An intelligent system capable of natural interaction should be able to perceive the user’s emotions and give corresponding emotional responses.Automatic affect recognition aims to identify emotions by analyzing and modelling multi-modal signals such as voice,facial and body video,as well as physiological signals.In recent years,great achievements have been achieved in continuous affect recognition.However,due to the errors in the annotations of the affective dimensions,as well as others factors such as illuminations,identity and gender difference,affect recognition in the wild still needs more efforts.This thesis investigates the robust multi-modal continuous affect recognition methods in the wild.The main works are as follows.1.A deep bidirectional long short term memory(Deep BLSTM)based continuous affect recognition model is proposed.In addition,an alignment method based on concordance correlation coefficient(CCC)is proposed to align the features and the affect labels.Continuous affect recognition experiments on RECOLA dataset show that the alignment method significantly improves the recognition performance.In multi-modal continuous affect recognition,the average CCC on Arousal and Valence reaches 0.723,which was ranked the first on the Audio Visual Emotion Challenge(AVEC2015).2.A weakly supervised learning approach based on hybrid deep neural network and bidirectional long short-term memory recurrent neural network(DNN-BLSTM)is proposed for continuous affect recognition.It firstly maps the audio/visual features into a more discriminative space via the powerful modelling capacities of DNN,then models the temporal dynamics of affect via BLSTM.To reduce the negative impact of the unreliable labels,a temporal label(TL)along with a robust loss function(RL)is proposed for incorporating weak supervision into the learning process of the DNN-BLSTM model.Single modal and multimodal affect recognition experiments have been carried out on the RECOLA dataset.Single modal recognition results show that the proposed method with TL and RL obtains remarkable improvements on both arousal and valence in terms of CCC,while multimodal recognition results show that with less feature streams,our proposed approach obtains comparable results with the state-of-the-art methods,with the average CCC on arousal and valence dimensions reaching 0.74.3.An adaptive weight network based model-level fusion approach is proposed for audiovisual continuous affect recognition.It considers the complementarity and redundancy between multiple streams from different modalities.In addition,it can efficiently incorporate side information such as gender using adaptive weight network.At last,a deep supervision based effective optimization strategy is proposed for training the proposed audiovisual continuous affect recognition model.Experimental results on the RECOLA dataset show that the proposed adaptive weight network improves the performance compared to a plain neural network without adaptive weights.Our approach obtains remarkable improvements on both arousal and valence in terms of CCC compared to state-of-the-art early fusion and modellevel fusion approaches,with the average CCC on valence and arousal reaching 0.727.4.By using an extended 3D Morphable Model(3DMM)which disentangles the identity factor from facial expressions of a specific person,a framework is proposed for extracting threedimensional facial spatio-temporal features from monocular image sequences.The contribution of this work is proposing an efficient 3D scene flow based features for continuous affect recognition.First,2D facial landmarks are located,then they are used to fit a 3D morphable face model to obtain 3D point clouds.By calculating the displacements of the 3D point cloud for successive frames,3D scene flow based features are obtained.Such features are robust to large facial pose,identity,illumination and background noise.A LSTM model is used to evaluate the effectiveness of the proposed 3D facial spatio-temporal features for video-based continuous affect recognition.Experiments are carried out on the RECOLA and SEMAINE datasets.On the RECOLA dataset,the average CCC on arousal and valence dimensions reaches 0.515.Compared to state-of-the-art features,our proposed features yield more accurate continuous affect recognition results.5.A continuous affect recognition model based on recurrent neural network and Bayesian filtering(RNN-BF)is proposed.This model first adopts a RNN to model the complex dynamic context of high-dimensional low-level information within image sequences or feature sequences,and then utilizes a Bayesian filter to model the dynamics of the low-dimensional high-level affective state.Moreover,it embeds a Gaussian filter to automatically align the features and annotations.The entire RNN-BF model is jointly optimized by error backpropagation algorithm and gradient descent algorithm,thereby the parameters of the Bayesian filter and Gaussian filter are obtained.The experimental results on the RECOLA database show that the embedding of Bayesian filters effectively improves the performance of continuous affect recognition,and the Gaussian filter further boosts the performance,with the average CCC of Arousal and Valence reaching 0.579. |