Font Size: a A A

Textual-and-Visual Based Multimodal Sentiment Recognition With Spatial-temporal Correlation

Posted on:2024-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:M H JiFull Text:PDF
GTID:2568307136491994Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Short videos are currently the most popular means of information dissemination,with complex and diverse data composition that includes not only video image information,but also various forms of data such as audio and text annotations.Effectively utilizing these multimodal data for sentiment analysis has become a hot topic among researchers.To address this problem,this thesis proposes a multimodal sentiment analysis model that fuses video and text annotations.The specific work is described as follows.(1)To address the problem of the difficult fusion of temporal and spatial dimension features in video data,this paper proposes a visual sentiment analysis method based on multi-head self-attention and spatiotemporal feature fusion.An improved Transformer structure is used to fuse deep spatiotemporal features in shallow feature maps of image sequences.Firstly,a CNN network is utilized to extract shallow visual features from the image frame sequence.Then,a multi-head selfattention mechanism is employed to extract deep spatial features.Based on this,temporal features are extracted and concatenated with spatiotemporal features to effectively capture deep emotional features from the visual modality.Finally,a classification network predicts the emotional category of the video sample.Experimental results demonstrate that compared to traditional methods for video emotion feature extraction,the proposed model can achieve superior performance improvements in recognizing video emotions.(2)To address the issue of inefficient extraction of semantic emotional features in text data,this paper proposes a text sentiment analysis method based on Bi-LSTM and dual-channel information enhancement.Firstly,the text word vectors are represented and enhanced with sentiment lexicon and position encoding to capture sentiment and position information.Then,two independent Bi-LSTM networks are utilized for channel feature extraction.The Transformer network is employed to learn the semantic correlation between dual-channel features.Finally,a classification network predicts the sentiment category of the text corpus.Experimental results demonstrate that compared to traditional methods for text emotion feature extraction,the proposed model can efficiently extract deep semantic sentiment features from text data and achieve more accurate model predictions.(3)To address the problem of difficulties in fusing data features between the visual and text modalities due to their structural differences,this paper proposes a multi-modal sentiment decision analysis method based on multi-layer weight matrices.By using locally weighted optimization matrices and posterior probability matrices to allocate decision weights for each modality,a decision fusion model is constructed for visual and textual modalities to determine the emotional category of data samples by analyzing their emotional value matrix.Experimental results show that compared with single-modal sentiment analysis models and multi-modal feature fusion models,this decision fusion model can effectively utilize information differences in multi-modal data and improve the analysis performance of the model.
Keywords/Search Tags:sentiment analysis, spatial-temporal feature fusion, attention mechanism, channel fusion, multimodal fusion
PDF Full Text Request
Related items