With the continuous development of artificial intelligence technology,as an important branch of the emotional computing field,emotion recognition has become a research hotspot.Because of the shortcomings of low recognition rate and poor robustness,single-modal emotion recognition has gradually shifted from single-modal emotion recognition to multimodal emotion recognition.By introducing more modal information,the complementary information between the modalities is captured,thereby improving the final recognition effect.How to effectively fuse different modal information is the key to multimodal emotion recognition,and it is also a major dilemma in multimodal emotion recognition.This paper mainly studies the multimodal emotion recognition based on the combination of three modalities of text,speech and video on the basis of feature layer fusion,and explores and improves the key techniques of multimodal emotion recognition.The main research contents of this thesis are:(1)The study applies to a feature extraction method that is effective on a single modality.For the text modality,the bidirectional LSTM network is mainly used to extract the text sentiment features,and the contextual semantics and word order information of the text are effectively utilized,so that the extracted text sentiment features contain important time series information;for the speech modality,the convolutional neural network is used for speech feature extraction,and the open source tool openSMILE is used to extract the basic features of the speech signal.The two are combined as the final speech emotion feature to make the speech features more complete.For the video modality,the three-dimensional convolutional neural network model is used to extract the video emotional features.Compared with the common convolutional neural network model,the time dimension is introduced,so that the extracted emotional features contain rich front and rear time series information,and the key-point features of face are introduced as auxiliary features,so that the extracted video emotional features are more abundant and effective.(2)Study the fusion of multimodal emotional features.The feature fusion mode of direct cascade is widely analyzed in detail,and its existing problems are proposed.The shortcomings of the direct cascade fusion method are improved.Finally,the feature layer fusion method based on attention mechanism is proposed.The modification method makes the characteristics of a single modality learn a weight that accords with the distribution of the data set,and then uses the weighted fusion when performing feature fusion,so that the features obtained by the fusion are more effective,thereby improving the recognition effect.(3)Based on the feature layer fusion method based on attention mechanism,the feature layer fusion method of introducing residual thought is proposed.The direct optimizing of mapping function is transformed into optimizing the residual,so that the mapping of introducing the residual is more sensitive to the change of the output,thereby better optimizing the network structure,making the network structure more expressive,and further improving the final emotion recognition effect.(4)Applying the proposed fusion method based on the attention mechanism and the introduction of the residual thought to the multimodal emotion recognition task,and conducting experimental verification on the public data set,analyzing and discussing through the experimental results,and drawing us the effectiveness of the proposed feature fusion algorithm. |