Speech emotion recognition plays a vital role in human-computer interaction,which can enhance the ability of smart devices to understand human intentions and improve user interaction experience.However,how to accurately perceive the user’s emotion in the process of human-computer interaction and quickly provide good feedback corresponding to the emotion has become one of the hot topics widely discussed by researchers.Previous research work has mainly revolved around traditional hand-crafted low-level single-modal features,but with the rapid development of deep learning and the massive growth of multi-modal information,neural networks are widely used and have made remarkable progress.Based on this,the research objects of this paper mainly include audio,text and image data.While fully learning the context representation,it captures significant emotional features and further improves the fusion method between multi-modal features.The main contributions of this paper are as follows:(1)To solve the problem that the emotion features captured by the model contain a large amount of redundant information unrelated to emotion,this paper proposes a local feature learning block model based on an upgraded attention mechanism(UA-LFLB).First,this paper preprocesses the original speech,extracts its Log-Mel spectrum diagram and the Delta and Deltas-delta of the spectrum diagram,and constructs 3D static data,thereby reducing the interference of irrelevant information,such as different speaking styles,and perceive the changing state of emotion and retain more emotional information.Then,the 3D static feature data is taken as the input of a local feature learning block,which is used to extract segment-level local sequence features.After that,it is passed to the upgrading attention mechanism.This mechanism is mainly used to capture contextual utterance-level salient features with contextual information.Finally,the learned more discriminative emotional features are fed to the softmax classifier for calculation and output of each emotional score.Experimental results show that,compared with the current optimal baseline method,the UA-LFLB model has improved by 12%,9.35%and 9.88% in the three evaluation indicators,respectively.(2)To solve the problem of single feature and information conflict in traditional feature fusion,this paper proposes a multi-mode self-attention fusion feature enhancement(MM-SAF)model based on the UA-LFLB model,which mainly includes audio,text and image data.This paper mainly analyzes and compares the performance of the traditional feature fusion method and the proposed self-attention feature fusion algorithm,and uses the center-Loss classifier to replace the traditional softmax classifier.The experimental results show that the MM-SAF model can make the same emotional features more concentrated and different emotional features more dispersed,and the recognition accuracy of the three evaluation indexes is 90.53%,90.13% and90.20%,respectively.To sum up,the research focus of this paper mainly focuses on feature extraction and feature fusion,aiming to learn comprehensive and salient emotion representation,and experimental verification is carried out on the IEMOCAP data set and multiple evaluation indicators.Compared with the advanced baseline methods,the performance of the proposed model is significantly improved,and it has excellent generalization ability. |