Visual And Text Feature Alignment Algorithm For Video Caption

Posted on:2024-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:S N Wu

Full Text:PDF

GTID:2568307097456994

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the continuous development of short video platforms,people’s demand for video processing is also increasing.Video caption is an effective method for understanding video content.It can automatically generate language description on the basis of fully understanding video content,and realize intelligent analysis of video content.Video caption method can be applied in video retrieval,assistance for visually impaired persons,video surveillance and other fields.Since the video caption task involves information from both visual and text modalities,in order to bridge the semantic gap between visual and text modalities and improve the model’s ability to describe video content,mainstream video caption methods are based on encoding and decoding framework.However,the description sentence output by the existing video caption model still has the problem of inaccurate and indetailed description.In this paper,a Visual Semantic Enhanced Encoder is constructed,including a visual-semantic embedding module and a multimodal feature fusion module.The video semantic embedding module is used to mine the semantic information inside the static features and temporal features of the video,and the multi-modal feature fusion module is used to capture the high-level interactive relationship between the two features.The two modules complement each other and help the encoder generate more powerful feature representations.Since vision and text belong to two different modal information and are in different feature spaces,it is very challenging to align these two features in the same space,and then convert visual information into text information.Therefore,this paper designs a Visual Guided Decoder for explicitly aligning these two features.The vision-guided decoder consists of two parts:a visual decision module and a dependency controller.Firstly,the Dependency Controller module is used to introduce the visual information most relevant to the predicted word,and then the contribution degree of the visual information and the text information is adaptively controlled in the process of word generation by relying on the controller,so as to prevent the decoder from relying too much on the superficial correlation between words properties(i.e.,language priors)to generate language descriptions that are unrelated to video content.Finally,this paper applies the algorithm of this paper to conduct experiments on the two public data sets MSVD and MSR-VTT datasets of the video caption.Compared with mainstream methods,the description sentences generated by the algorithm in this paper are more fluent and accurate,and can better reflect the content of the video.

Keywords/Search Tags:

Video caption, Visual Semantic Enhanced Encoder, Visual Guided Decoder, Dependency Controller

PDF Full Text Request

Related items

1	Video Caption Based On Enhanced Visual-semantics
2	Research Of Image Caption Based On Encoder-Decoder
3	Research On Video Captioning Methods Based On Encoder-decoder Structure
4	Research On Video Caption Algorithm Based On Encoder-Decoder Model
5	Image Caption Model Based On Feature Extraction Via Dense Convolutional Neural Network
6	Research On Video Caption Based On Deep Learning Sequence Model
7	Research Of Video Semantic Caption Generation Based On Reconstructing Features
8	Research On Complex Semantic Recognition And Description Of Visual Content
9	Decoder Oriented Visual Attention Model For Video Summarization
10	Research On Image Semantic Caption Generation Based On Encoder-Decoder Framework