| Speech emotion recognition(SER)is one of the key technologies for realizing humancomputer interaction,and has significant practical value in criminal investigation,smart nursing,intelligent customer service,etc.The research of SER related technologies has strong practical needs and social significance.In recent years,remarkable achievements have been made in SER technology research.However,there are still many shortcomings that need to be addressed.Influenced by individual differences among different speakers,current SER systems have an unavoidable individual difference in emotional representation for different speakers in non-specific speaker scenarios.These individual differences hinder the system from obtaining more universal emotional features,thereby limiting the further improvement of emotion recognition performance.To improve the performance of speech emotion recognition(SER)systems in nonspecific person scenarios,this paper proposes to research from three perspectives: feature extraction,feature fusion,and feature decoupling.The goal is to enhance the accuracy,stability,and non-emotional information elimination of the SER system,making it more suitable for speech emotion recognition in non-specific person scenarios.The main work of this paper is as follows:(1)A speech emotion recognition algorithm based on dynamic convolutional recurrent neural networks is proposed.In order to obtain more discriminative language emotion features and improve the recognition model accuracy,the algorithm jointly models from the perspective of speech global features and temporal features.In it,dynamic convolutional network based on time-frequency attention mechanism focuses on mining speech global dynamic emotion features,and simultaneously uses attention mechanism to strengthen the representation of key emotion area in feature map from the perspective of time and frequency;Bi-LSTM focuses on learning dynamic frame-level features and bidirectional temporal information of speech;finally,using maximum density divergence loss to align new individual features and training set features to reduce the impact of individual differences on feature distribution,and enhance the model representation ability.The experimental results show that the proposed model improved the accuracy of speech emotion recognition by at least 0.71%~2.16% compared with the standard baseline model.(2)A speech emotion recognition algorithm based on attention mechanism fusion of multi-stream convolutional recurrent neural networks is proposed.In order to maintain the discriminability of emotion features while considering feature robustness,thereby improving the stability of the model,the algorithm models from the perspective of multiresolution global feature fusion and global temporal feature fusion.Firstly,a branch structure is introduced after each pooling layer of convolutional neural network,in order to retain the global features of different resolutions;at the same time,a fast connection mode is proposed for rapid fusion of multi-resolution global features;finally,the multi-head selfattention fusion algorithm is used to realize the adaptive weighting fusion of global features and temporal features.The experimental results show that the proposed model improves the accuracy of speech emotion recognition by at least 2.03%~3.76% compared with the advanced method,and the system stability is better.(3)A speech emotion recognition algorithm based on self-supervised model speaker feature decoupling network is proposed.In order to further enhance the interpretability of emotional features and eliminate non-emotion information in speech emotion features,the algorithm starts from the perspective of feature decoupling,and represents tangled highdimensional speaker features and emotional features at the feature level.In it,voice selfsupervised model is used to extract speech features widely applicable to downstream tasks in the feature extraction stage;then,a β-attention VAE is proposed,which decouples the generic speech representation during training;finally,using the joint loss guided by additive angular interval loss and R-drop regularization,the network learns emotion features invariant to the speaker.The experimental results show that the proposed model improves the accuracy of speech emotion recognition by 2.44%~5.15% compared to the self-supervised baseline algorithm. |