| With the rapid development of technology,people’s requirements for human-computer interaction are getting higher and higher.Emotion computing and human-computer interaction are inseparable from each other,and emotion recognition,as the basis of the former,has become the main direction of research breakthroughs in recent years.In speech emotion recognition,there are often inadequate extraction of emotion features,ignoring many important features,or too many irrelevant features in the extracted emotion features,leading to poor recognition results;moreover,most of the datasets used for emotion recognition are obtained in a pure and noise-free situation,and the models built cannot obtain effective emotion features in natural scenes with complex background noise.To address the above problems,this paper proposes a multimodal sentiment recognition model based on deep residual shrinkage network in natural context,and applies the network model to the feature extraction of speech spectrogram and sentiment recognition in natural context.In order to make the speech signal required by the experiment more smooth and uniform in the spectrum,and to extract better signal parameters,the original speech signal is preprocessed by pre emphasis,framing,windowing,etc;After preprocessing,the speech signal is transformed by Fourier Transform,then rotated and mapped,and the multi frame spectrum obtained is spliced to obtain the power spectrum.Finally,the spectrogram is obtained through Mel filter.The above process is to do preprocessing for the subsequent automatic extraction of emotional features.A natural speech emotion recognition algorithm based on deep residual shrinkage network is proposed.To ensure the extraction of important features of speech emotion and the removal of irrelevant features,a deep residual shrinkage module is used to increase the weights of important emotion features;a bidirectional gated recurrent unit is added to further extract the temporal information of speech emotional features,reducing the complex parameters of the model.The recognition accuracy is 86.07% and 86.03% on the IEMOCAP dataset and CASIA dataset,respectively,and 70.57% on the MELD dataset with a more complex background environment,effectively solving the problem of low recognition rate of most speech emotions in natural contexts,but there are still problems of low recognition rate and confusion of individual emotions.The proposed multimodal sentiment recognition with combined speech and text first adds a bidirectional gating loop unit to the XLNet pre-training model to enable the model to further mine the information of word vectors,and adds an attention mechanism to assign weights mainly on important word vectors for learning.BIGRU enables the model to better exploit the connection between contextual semantics,making the learned word vector features richer in meaning and accurate.Then speech and text are fused at the decision level,and the emotion recognition results of the two different modalities are weighted and fused by the CatBoost algorithm,which further increases the emotion recognition accuracy.Finally,experimental analysis is conducted on the dataset to verify the feasibility of the proposed algorithm for multimodal emotion recognition in natural contexts based on deep residual shrinkage network. |