| Speech is an important medium in interpersonal communication,it not only conveys ideas,but also contains the emotional state of the speaker.If the machine can accurately grasp the emotional state of the speaker and respond according to people’s emotional state,it will be a great progress in the field of human-computer intelligent interaction.In order to promote this goal,researchers have invested radially in the upsurge of speech emotion recognition research.Among them,the researchers take the selection of feature parameters and the construction of classification models as the main research directions.The quality of emotional features and recognition models will directly affect the final emotion recognition effect.In recent years,deep learning has developed rapidly and has been applied to various fields.In order to train strong discriminative emotional feature parameters and improve the recognition effect of the model,this thesis uses deep learning and other related knowledge to propose a convolutional recurrent neural network speech emotion recognition model with improved loss function and a feature fusion model based on attention mechanism.The specific work is as follows:(1)In order to capture emotion-related segments in speech and improve the recognition effect of speech emotion.This thesis introduces a self-attention mechanism based on the Convolutional Gated Recurrent Network model(CGRU),and proposes a Convolutional Gated Recurrent Network based on the self-attention mechanism model(SA-CGRU).First,the deep-level feature extraction is performed through the convolutional neural network,and then the extracted features are input into the Bidirectional Gated Recurrent Network(Bi GRU)along the time dimension to obtain the time-series features of the speech signal.The selfattention mechanism can spontaneously find the correlation between features at different times through the weight matrix.Therefore,in this thesis,the output of the Bi GRU layer is input into the self-attention module to enhance the model’s ability to learn emotion-related segments in speech,thereby improving the effect of emotion recognition.(2)In order to capture the correlation between features,the intra-class correlation of the final trained sentiment features is improved.In the thesis,the consistency correlation coefficient is introduced,the consistency correlation loss function is proposed,and the SACGRU model based on the improved loss function is proposed.Consistency Correlation Coefficient(CCC)is usually used to measure the correlation between two vectors.The higher the correlation coefficient of intra-class consistency,the higher the similarity between features.The thesis proposes the new loss function based on the consistency correlation coefficient ?CCC-Loss,and forms the final loss function together with the cross entropy loss function.It is optimized to continuously reduce it through training,and the recognition result of the model is improved.(3)In order to make full use of the emotional information contained in different features,this thesis combines low-level features(LLDs)and spectrograms,and proposes a dualchannel SA-CGRU speech emotion recognition model.LLDs features and spectrograms are different feature representations of speech signals,and there is complementarity between them.Both types of features in this thesis are extracted from deep features through SA-CGRU network,and then the advantages of both are combined through decision-level fusion or feature-level fusion to obtain a stable and high recognition rate.In order to verify the effectiveness of the algorithm,the thesis conducts a comparative test on the EMODB and CASIA datasets.The experimental results show that the method proposed in the thesis has better recognition performance.The method proposed in the thesis has achieved 92.90% and 90.58% recognition results on EMODB and CASIA databases,respectively. |