Font Size: a A A

Research On Video Annotation Technology Based On Multimodality

Posted on:2021-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2518306107953159Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Video annotation technology can analyze the video information,understand the video content,annotate the video to achieve accuracy comparable to humans.With the increasing scale of video on the Internet,it is urgent to research related algorithms to find videos that users are interested in,the research of these algorithms is inseparable from video annotation.Therefore,video annotation technology is of great significance.Video annotation algorithm based on visual feature extracts the features of video frames through a convolutional neural network,aggregates the features over time,finally annotate video.The absence of audio features and importance of frames makes the method not accurate.In order to make video annotation more accurate,in view of the lack of existing video annotation models and algorithms,by combining visual features and audio features,a multi-modal video annotation technology is proposed.First of all,in order to better extract the visual features,first extract the key frames of the video before extracting the frame features,remove the redundant frames of the video,then use the deep convolution neural network to extract the visual information of each frame,and attention mechanism is added during aggregation,considering the importance of each frame to the video,NetAC pooling model based on attention mechanism is proposed.When processing the audio information of the video,the log Mel spectrum of the audio is first extracted,then the continuous audio manual features are processed using a deep convolutional neural network,the processed multi-segment audio frame features are input to the learning pool for aggregation.The obtained visual features and audio features are fused,and the dependencies between features are captured through the gate mechanism to obtain the final video features,and then the video features are input into the decoder to decode and obtain the final video annotation results.Using the NetAC pooling model and multiple pooling models to conduct video annotation experiments in audio modal,visual modal and multi-modality,respectively,the effectiveness of the NetAC pooling model is verified,and audio is an important feature of video can effectively improve the accuracy of video annotation.
Keywords/Search Tags:video annotation, multi-modality, key frame extraction, convolutional neural network, learning pool
PDF Full Text Request
Related items