| Emotion is a major marker for understanding the intention behind a discourse,and speech emotion recognition technology can effectively improve the work efficiency of users in human-computer interaction systems in the fields of mental health analysis,intelligent robots,and driving assistance.Emotions are complex,and extracting specific features associated with specific emotions is one of the important aspects in speech emotion recognition research.Meanwhile,in the field of emotion recognition,multimodal systems are more efficient in recognizing the emotions of speakers.In this thesis,the above two aspects are studied and a speech emotion recognition system is designed.First,this thesis designs a general framework of a speech emotion recognition system,expounds the basic theories and recognition models of speech emotion recognition,and analyzes the related research methods of emotion recognition.By summarizing the problems existing in the current research,it is clear that the research emphases of this thesis are on the aspects of feature extraction and modal fusion.Secondly,for the problem of the low recognition accuracy of emotion recognition models is easily caused by interference such as speech features data redundancy and irrelevant features.In this thesis,an Attentional-based Three Dimensional Convolutional Recurrent Neural Network(3DACRNN)method for speech emotion feature extraction is proposed.Trained by feeding log-mel spectrograms into the Residual Network-based Three Dimensional Attentional Convolutional Neural Network(3DRACNN)designed to extract speech emotion features,and then the temporal information is extracted by Bidirectional Gated Recurrent Unit(Bi-GRU).Comparative experiments were carried out respectively,and the results demonstrate that the 3DACRNN can extract effective emotion information and improve the accuracy of emotion recognition.Then,for the problem that the use of unimodal information cannot accurately and comprehensively identify the emotional state of the speaker,an Attentional-based Convolutional Neural Network Bi-directional Gated Recurrent Unit Fusing Visual Information(VACRNN)speech emotion recognition model is proposed in this thesis.Facial expressions are used to interpret speech emotions in videos to improve the performance of speech emotion recognition systems.The CNN and a series of Gated Recurrent Units with Attention mechanisms(AGRUs)architecture are used to extract the discriminative features characterizing facial appearance and geometric shape changes,and fused them sequentially with the speech features obtained from the pre-trained3 DRACNN,using the Bi-GRU fusion network and feature concatenation method,considering the contextual information comprehensively,and on the basis of retaining the information differences between modalities,the emotion features are obtained for emotion classification recognition.The experimental results demonstrate that the recognition accuracy of the method in the thesis is improved by 3.68% and 4.59%,respectively,compared with the methods in the literature on the corresponding datasets,which effectively improves the accuracy and robustness of speech emotion recognition.Finally,this thesis applies the proposed feature extraction method and the VACRNN-based speech emotion recognition model to the designed speech emotion recognition system,and conducts experiments on CH-SIMS and self-made datasets.The experimental results demonstrate the effectiveness of the speech emotion recognition system developed in this thesis,while the accuracy and robustness of speech emotion recognition are improved. |