| With the rapid development of artificial intelligence technology,speech emotion recognition technology has become a research hotspot.In speech emotion recognition technology,the discrimination of emotion features and the robustness of classifier model are the main factors affecting the performance of the system.In order to improve the performance of speech emotion recognition system,a speech emotion recognition method based on multi-scale feature fusion and multi task learning,and a speech emotion recognition method based on decision tree CNN and multi task learning are proposed in this paper.The research work is as follows:(1)Firstly,the research background and research status of speech emotion recognition technology are introduced in this paper.Then,the basic architecture of speech emotion recognition system is summarized,and the relevant knowledge covered by each module of the system are introduced,including speech emotion database,speech signal preprocessing technology and common emotion features.Finally,the calculation method of global statistics,feature preprocessing technology and deep learning theory are introduced,which lays the foundation for the follow-up research work.(2)The emotional features extracted under different scale frame lengths are different,so the fusion of multi-scale features can make full use of the diversity of features.In order to mine the emotional information in speech signals more comprehensively,a speech emotion recognition method based on multi-scale feature fusion and multi task learning is proposed in this paper.Firstly,the speech signal is preprocessed by pre emphasis,framing,windowing and endpoint detection.Then,MFCC and its first-order differential dynamic features,energy,pitch frequency and short-time zero crossing rate are extracted under different scale frame lengths.Then,the statistical characteristics of multiple emotional features under each scale frame length are calculated to represent the global statistical characteristics of speech emotion.Finally,the statistical features of multiple scale frame lengths are fused.In addition,a convolutional neural network model based on multi task learning strategy is constructed in this paper.Speech gender classification is added to the CNN model as an auxiliary task.Based on the multi task learning strategy,the main task of speech emotion classification learns the features conducive to emotion classification and improve the generalization ability of the model.The experimental results on EMO-DB and CASIA emotion database show that the speech emotion recognition method based on multi-scale feature fusion and multi task learning can effectively improve the performance of speech emotion recognition.(3)In the speech emotion recognition method based on multi-scale feature fusion and multi task learning,the speech emotion recognition rate is largely affected by some kinds of confusing emotions.To solve this problem,a speech emotion recognition method based on decision tree CNN and multi task learning is proposed in this paper.Through the constructed decision tree model,the emotion can be effectively divided from coarse to fine,and further improve the recognition performance of confusing emotions.Firstly,the confusion degree between emotions is calculated according to the emotion confusion matrix,And then the decision tree model is constructed according to the emotion confusion degree,and the emotions are divided into different groups.Finally,according to different emotion groups,the CNN model for the emotion group is constructed,and the multi task learning super parameters in the CNN model are optimized.Experiments on CASIA emotion database show that the speech emotion recognition method based on decision tree CNN and multi task learning has a significant improvement in recognition rate compared with the speech emotion recognition method based on multi-scale feature fusion and multi task learning. |