Font Size: a A A

Research On Key Technologies Of Self-Attention Deep Learning Model For Video Semantic Understanding

Posted on:2024-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:J B HuangFull Text:PDF
GTID:2568307181450774Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The semantic understanding of videos refers to the ability of machines to simulate human cognitive abilities,automatically understanding the semantic information of videos,transforming long videos into short videos,and accurately describing the themes and content of videos using text.Video summarization and video description are fundamental and key technologies that support the rapid development of the internet short video,video intelligent retrieval,video intelligent criminal investigation,and other fields driven by video semantic understanding technology.Video summarization automatically analyzes the video content to remove redundant information and extract the key frames or key scenes that best represent the original video content to generate short videos.Video description uses computer vision and natural language processing technologies to semantically understand the video content,achieving cross-modal conversion from visual information to textual information,and ultimately generating accurate textual descriptions of video themes and content,which are essential for achieving video intelligent search and recommendation.This thesis focuses on two key technologies for video semantic understanding,selfattention deep model for video summarization and video description.The main contents are:(1)A dynamic video summarization generation method based on the fusion of video semantic features.In the encoding phase,the I3 D dual-stream convolutional network is used to extract static and dynamic features from the video sequence,and the channel attention mechanism model is used to understand and fuse the semantic features of these two modalities,effectively avoiding the encoder from encoding the video sequence into a fixed length.In the decoding phase,the dot-product attention mechanism and LSTM model can selectively focus on important contextual information,thereby generating higher quality dynamic video summaries;(2)A video description method based on instance-aware temporal feature module and self-attention module.In order to solve the limitations of existing video encoding models in extracting deep semantic information,and to take into account the different contributions of different visual elements in the video to sentence generation,and the fact that the baseline encoding-decoding model cannot effectively understand complex contextual correlation information in the video sequence,this thesis proposes a video description method that integrates instance-aware temporal features and self-attention modules.This method enhances the temporal feature’s perception of instances by extracting instance object features from the video.Meanwhile,using a bidirectional long short-term memory network and self-attention module,the contextual information of the video sequence is captured to encode the video sequence and help the decoder generate higher quality descriptive text.(3)Cross-modal semantic reconstruction video description generation method.In order to further understand the content of the video,bridge the semantic gap between the two feature spaces of visual and textual,and explore their connection and interaction,this thesis proposes a video description method based on position masking and cross-modal semantic reconstruction.This method strengthens the model’s understanding of video semantics by adding position masking in visual space,and extracts the corresponding text features of the dataset by using a text feature extraction model.The cross-modal semantic reconstruction module reconstructs the feature encoding that includes contextual semantic information to enhance the effect of video description.Each chapter of this thesis conducted experiments on widely used video summarization and video description datasets.Through ablation experiments and comparative analysis of the proposed models,the effectiveness of the methods presented in this thesis has been verified.
Keywords/Search Tags:Video summarization, Video description, Self-attention, Position masking, Semantic reconstruction
PDF Full Text Request
Related items