Font Size: a A A

Research On Depression Recognition From Audio And Visual Cues

Posted on:2020-06-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HeFull Text:PDF
GTID:1484306740472914Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the speeding up of work and life,depression becomes a common mental disorder.There exists the confliction between the high prevalence of depression and the serious shortage of medical staffs.To help clinicians effectively and efficiently diagnose depression severity,researchers in the field of affective computing attempt to leverage the knowledge of artificial intelligence,psychology,physiology,and cognitive research to assess the severity of depression from audiovisual and physiological information.In this dissertation,we focus on the audiovisual features and models,and propose some methods for depression recognition and verify the effectiveness of our proposed methods on the AVEC2013 and AVEC2014 depression databases.The main contributions of this dissertation are as follows:1.To deal with the issues that the dimension of the visual features is too high and the feature aggregation method for the hand-crafted features needs empirically setting the number of Gaussian components,we propose a method learning temporal feature representations around facial region and represent global features from frame-level features for depression recognition from videos.Based on the Median Robust Extended Local Binary Patterns(MRELBP)feature,we compute the dynamic discriminate pattern of center pixel representation and neighbor representation of the facial image sequences,and propose the Median Robust Extended Local Binary Patterns from Three Orthogonal Planes(MRELBP-TOP)features to capture different characteristics of facial expression images.To represent the global feature vector over the frame-level MRELBP-TOP features for each sub-video clip,Dirichlet Process Gaussian Mixture Model-Fisher Vector(DPFV)is proposed.MRELBPTOP can learn local,global,and sequence feature representation of facial image sequences.In addition,DPFV adopts Dirichlet Process Gaussian Mixture Models(DPGMM)to automatically learn the number of Gaussian Mixture Models(GMM)mixtures and represent the high-level characteristic patterns of MRELBP-TOP features.Support Vector Machine for Regression(SVR)with intersection kernel,is adopted to predict the severity of depression.The root mean square error(RMSE)between the predicted values and the BDI-II scores is 9.20 and 9.01 on the test set of AVEC2013 and AVEC2014,respectively,which are lower than those of the state of the art video-based depression recognition methods.2.To represent the long term temporal information from the facial image sequences,we propose to adopt 3D Convolutional Neural Network(3D-CNN)and Spatial-Temporal Attention based Conv LSTM(STA-Conv LSTM)networks for depression recognition.This method uses 3D-CNN and STA-Conv LSTM to learn long term global and high-level features from the video,and perform the end-to-end depression recognition.On the test set of AVEC2013 and AVEC2014,this approach obtains the RMSE as 10.32 and 10.27,respectively,which are lower than those of most of the state of the art video-based depression recognition methods.3.To solve the problem that a large number of facial images are needed to fully train a deep model,and the parallelism of pre-trained deep models is not considered by the current methods,we present a deep regression network named as Dep Net for prediction of depression severity from facial image sequences.Joint fine-tuning is performed in parallel on multiple 2D Convolutional Neural Networks(2D-CNNs)using a small amount of face image sequences.Dynamics Temporal Feature Aggregation Module(DTFAM)is adopted to represent and aggregate high-level features on the deep-learned features.The experimental results on the test set of AVEC2013 and AVEC2014 show that our proposed approach is promising,when compared to those of the state of the art deep learning-based depression recognition approaches.Compared with the 3D convolutional Neural Network-Recurrent Neural Network(C3D-RNN)based depression recognition method,the RMSE is reduced by 1.19% and 1.53%,respectively.4.We propose to jointly fine-tune the 2D-CNNs for audio-visual multimodal depression recognition.For depression recognition from audio segments,we jointly fine-tune the2D-CNN models of the global features?MRELBP of spectrogram?raw speech signal,and deeply fuse the patterns from the different input streams.The RMSE is 10.00 and 9.98 on the test set of AVEC2013 and AVEC2014,respectively,which are lower than those of the most of the audio-based depression recognition methods.For audio visual multimodal depression recognition,we jointly fine-tune the 2D-CNNs of the VGG-Face features and the speech spectrogram to boost the depression recognition performance.The RMSEs are10.08 on the test set of AVEC2013 and 9.99 on the test set of AVEC2014,which are both lower than those of the state-of-the-art audiovisual depression recognition methods.
Keywords/Search Tags:Depression, DPGMM, MRELBP-TOP, DepNet, CNN, STA-ConvLSTM
PDF Full Text Request
Related items