Research On Video Caption Based On Deep Learning Sequence Model

Posted on:2021-05-25

Degree:Master

Type:Thesis

Country:China

Candidate:X Hao

Full Text:PDF

GTID:2428330632962846

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Video description is a research hotspot in the field of computer vision,with a wide range of application scenarios,including video retrieval,video understanding and other fields.The definition of video description is to describe a video segment with a short English sentence.At present,there are still many deficiencies in the research in this field.The purpose of this paper is to study from the following angles to improve the accuracy of sentence generation by video description algorithms.Video is a complex carrier containing different modal information,and its main input is visual and audio information.Different modal information can capture the characteristics of different dimensions to form a complement.Visual information covers most of the video information,audio information can effectively gain visual information.The current main research is still focused on the field of visual modalities.Intuitively,the fusion of audio information and visual information can improve the accuracy of video description generation.Therefore,how to efficiently encode the visual information contained in the video and how to effectively fuse the information of different modalities is one of the challenges facing the current video description generation.In recent years,deep learning with neural networks as the core has achieved successful practice in countless fields.Deep neural networks with autonomous learning capabilities have become the primary choice for video description generation tasks.The latest research practice and conclusion also prove its superiority.In this paper,the overall frame and local structure of the algorithm are improved to solve the problems of visual information hierarchy blurring and modal fusion in video description.The details are as follows:1.The current research on constructing video description tasks based on visual information ignores the hierarchical characteristics of video structure information.Taking the improvement of the ability to recognize scene switching information in visual information as the starting point,this paper proposes a video description generation network based on scene edge detection encoder.In the coding stage,the network adaptively learns whether to reach the switching point of the edge of the scene,which can give the visual feature to encode more hierarchical structure information.The effectiveness of the network model was verified from a qualitative and quantitative perspective,and quite competitive results were obtained on two public data sets.2.Current research on video description tasks ignores the effect of audio timing information on the attention mechanism in the process of text sequence generation.In this paper,taking the increase of the contribution of audio information in the text attention mechanism as an entry point,a video description network based on audio and video multi-modal attention mechanism is proposed.By designing audio and visual information to participate in the decision text generation process,the audio Information can supplement visual information.The validity of the model is analyzed from a qualitative and quantitative point of view,and good results are obtained on public data sets.3.In this paper,a video description algorithm model based on multi-layer audio and video cross-modal attention is proposed based on the comprehensive use of feature coding at different levels and different granularities to represent the video.By using multi-layer encoders to encode at both ends of the audio and video,the feature vectors of the video in different modes and different levels are obtained.Then through the multiple attention mechanism in the model,a total of four kinds of features from two modalities at different levels are fused and expressed,which improves the expression ability of video feature encoding.The experimental results based on this model have achieved certain improvement effects on various evaluation indicators on the public data set.

Keywords/Search Tags:

video caption, encoder-decoder, multi-modal attention, multi-layer encoder

PDF Full Text Request

Related items

1	Research On Video Caption Algorithm Based On Encoder-Decoder Model
2	Encoder-decoder Model For Multi-aspect Sentiment Classification
3	Image Caption Model Based On Feature Extraction Via Dense Convolutional Neural Network
4	Research On Image Caption Based On Attention Mechanism
5	The Research Of Image Captioning Based On Multi-Attention Model And Copy Mechanism
6	Research On Encoder-Decoder Model For Complex Structure Text Recognition
7	Image Caption Technology Based On Deep Semantic Information
8	Image Chinese Caption Generation Method Based On Attention Mechanism
9	Research Of Video Semantic Caption Generation Based On Reconstructing Features
10	Research On Key Technologies Of Image Caption Based On Multimodal Feature Understanding