Font Size: a A A

Research On Deep Learning-based Depression Recognition From Facial Expression And Speech

Posted on:2023-08-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:W T GuoFull Text:PDF
GTID:1524307025959639Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
According to data from the World Health Organization,depression will become the most common mental disease by 2030,which will seriously burden individuals,families,and society.However,due to the severe imbalance in the proportion of doctors and patients worldwide,many depressive disorders may not be able to get a timely diagnosis.At present,the diagnosis of depression is mainly based on scales and questionnaires,but these methods are influenced by factors such as subjectivity,high concealment,strong dependence on experts,and high misdiagnosis rate.Since recent studies have found that depression affects patients’ facial and speech expressions,facial expression and speech have become the core behavioral indicators of depression recognition.Deep learning has made many achievements in depression recognition in recent years with its powerful feature representation and fusion abilities.However,the difficulties and challenges in depression recognition still exist.First,collecting the facial expression and speech data of depressive disorders is complicated due to the requirements of ethics and privacy protection.The scale of the dataset is seriously insufficient,which challenges the application of deep learning methods.Second,because of the limited data samples,extracting facial expression features and speech features that can fully describe the characteristics of depressive disorders is also a problem that needs further research.Third,because depressive disorders are a particular group,the data quality largely depends on the degree of cooperation of the subjects.If the subjects do not cooperate,it will be challenging to keep the collected facial expression data and speech data consistent in the time dimension,which makes the effect of the audio and video multimodal fusion method unsatisfactory.In order to solve the above problems,the dissertation adopts the self-built Chinese localized depression dataset,including voice data,video data,deep video data,and emotional state data,to carry out the research of using deep learning to recognize depression from facial expressions and speech.In the aspect of facial expression-based depression recognition,the dissertation first proposes a method to model the facial expression of depressive disorders by fusing 2D and 3D visual information from different data sources on small data sets.The dissertation also proposes a method for modeling long-term facial expressions by fusing a visual attention mechanism to obtain the global temporal and spatial characteristics of significant expression changes of depressive disorders by considering the influence of long-term facial expressions of depressive disorders on depression recognition.In the aspect of speech-based depression recognition,the dissertation combines speaker features and speech emotional features to obtain the speech feature representation of depressive disorders and recognize depression through the Mixed-of-Expert model.Finally,to further consider the comprehensive impact of facial expression and speech expression on depression recognition,the dissertation proposes a cross-modal deep learning network based on Transform to obtain the joint representation of the facial expression modal and speech modal to recognize depression.The main works and contributions of the dissertation are as follows:Firstly,the dissertation proposes a depression recognition method based on two Deep Belief Network(DBN)models with the self-developed Chinese native depression dataset to solve the problem of insufficient audio and video data for depression recognition.The first DBN model extracts 2D static facial expression features from images collected by optical cameras,whereas the second DBN model extracts 3D dynamic facial expression features from 3D facial points collected by the Kinect depth cameras.The final decision result comes from combining the two models through the jointly fine adjustment to fuse the static and dynamic features.The experimental results demonstrate that the accuracy of the proposed method in the self-built Chinese localized depression dataset achieves 72.14%,and the recognition performance of combining 2D and 3D feature models is better than that of using 2D or 3D feature models individually.Furthermore,under the stimulation of positive and negative emotions,the accuracy of depression recognition is higher,especially the accuracy of women is generally higher than that of men.Therefore,the results show that the proposed method can recognize patients with potential depression risk on a small sample dataset.Secondly,the dissertation proposes a method to model long-term facial expression by fusing the visual attention mechanism to obtain the features of significant expression changes in depressive disorders for depression recognition to aim at the phenomenon of facial expression retardation caused by the cognitive bias of depressive disorders.The dissertation uses the global average pooling features and maximum pooling features to calculate temporal,channel,and spatial attention based on the 3D convolutional residual network.The attention map is calculated along the three dimensions of time,channel,and space to obtain the attention of the time channel space combination and selectively embeds it into the 3D convolutional residual network.At the same time,the convolution Long Short-Term Memory(LSTM)variant is embedded into the 3D convolutional residual network to obtain the global spatiotemporal features with significant expression changes.The experimental results show that the accuracy of the proposed method in the Chinese localized depression dataset is 78.60%,and the mean absolute error(MAE)in the AVEC2014 dataset is 5.68,which is superior to the state-of-the-art methods.The proposed time channel space attention mechanism module can learn the essential features for depression recognition.Thirdly,the dissertation proposes a depression recognition method based on a Mixture-ofExperts(Mo E)model that combines both personal features and emotional features of speakers to deal with the influence of personality and emotional characteristics of speech on depressive disorders’ utterances to varying degrees.First of all,the dissertation pre-trains a Time Delay Neural Network(TDNN)-based speaker personal feature extractor with a large-scale speaker recognition dataset and a Res Net-based speech emotional feature extractor using a large-scale emotional speech corpus.Then,the dissertation uses a multi-source domain adaptive algorithm to train an Mo E model to recognize depression after merging the extracted personal features and emotional features of depressive disorders with speech depression datasets.The experimental results show that the proposed method achieves an accuracy of 74.3% on the self-built Chinese localized depression dataset.Furthermore,the MAE value on the AVEC2014 dataset is 6.32,which is superior to the state-of-the-art depression recognition methods using deep learning with speech features.Furthermore,the accuracy of the proposed method is higher in both voice question and answering tasks and reading aloud tasks.Therefore,the proposed method can effectively recognize depression from speech.Finally,the dissertation proposes a depression recognition method based on the Self-attentionbased cross-modal coding to solve the problem that the facial expression and speech of depressive disorders are difficult to keep consistent in the time dimension.By referring to the Self-attention encoder of Transformer,the proposed method designs the guided attention unit and the self-attention unit for synergistically learning not only the cross-modal representation of speech and facial expression but also the unique information of a single mode.Five co-attention modules are built,and two cascading methods,including the stack concatenated method and encoding-decoding concatenated method,are used to build a co-attention network to recognize depression.The experimental results show that the proposed method achieves 83.9% accuracy on the self-built Chinese localized depression dataset,which is superior to the performance of recognizing depression from facial expressions or speech individually.Furthermore,the MAE on the AVEC2014 dataset is 5.38,which is higher than the state-of-the-art methods.The experimental results further show that the audio or video-modeled self-attention unit can highlight the features of a single mode in the co-attention network.The cross-modal mutual guided-attention unit can learn the relationship between speech and facial expression features.Meanwhile,the self-attention features learned at last are better than those learned before.Because guiding another modal with better features will learn better,the recognition performance of the encoding-decoding cascading method is better than that of the stack cascading method.
Keywords/Search Tags:depression recognition, cross-modal depression recognition, visual attention, facial expression recognition, speech emotion recognition
PDF Full Text Request
Related items