Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning

Posted on:2022-11-19

Degree:Master

Type:Thesis

Country:China

Candidate:Z Chang

Full Text:PDF

GTID:2518306743474304

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of deep learning technology and the powerful learning ability of neural networks,video captioning,as a new cross-modal task connecting computer vision and natural language processing,has attracted extensive attention from scholars at home and abroad,and has achieved more and more authoritative and outstanding achievements.The goal of video captioning is to automatically generate text description for a given video clip,and the dataset is usually to annotate a short video clip.Dense video captioning is a branch of video captioning,which needs to analyze longer and more complex video sequences and generates text descriptions of multiple events in a long video.The main work of this paper will focus on two tasks:video captioning and dense video captioning.For video captioning,a multi-feature fusion method based on action reasoning is proposed for optimizing the interaction prediction between two objects,addressing the drawback that most methods produce actions that depend on object co-occurrence.This method explicitly targets actions for inference and improves the recognition of actions by extracting and modeling 2D convolutional features,3D convolutional features and local features of the video to capture better visual dynamics,thus further improving the quality of description.Extensive comparison experiments were conducted on the publicly available MSVD and MSR-VTT datasets.The experimental results show that the present model can successfully improve the description of video actions and achieve competitive scores in four metrics,BLEU4,METEOR,CIDEr,and ROUGE-L.For dense video captioning,a multi-modal fusion method based on event interactivity is proposed in this paper.It is used to solve the problem of multiple event descriptions in the same video without continuity,correlation and lack of capturing audio information in the video.The current step of dense video captioning task is to first locate the contained events in a long video,and then perform video captioning for each event,so that the description of each event can be generated,but such generated descriptions lack interaction between events.Multiple events in the same video should be connected and not independent of each other.To address this problem,we put forward an event interactivity approach,where we model the temporal and semantic relationships between different events in the event localization phase to establish the interaction,thus generating more consistent and continuous descriptions.In addition,we extract the visual and audio features of the videos to further improve the accuracy of the descriptions from a multi-modal perspective.In this paper,we conducted sufficient experiments on publicly available datasets and achieved a Meteor score of9.64 on the Activity Net dataset,which is a 31.8% improvement over the mainstream model MDVC,achieving a performance that can compete with current state-of-the-art models.

Keywords/Search Tags:

Deep learning, Video captioning, Multi-feature fusion, Dense video captioning, Multi-modal fusion

PDF Full Text Request

Related items

1	Research And Application Of Video Captioning Technology Based On Deep Learning
2	Video Captioning Based On Deep Learning
3	Researches On Short Video Captioning Based On Deep Learning
4	Video Captioning Based On Deep Learning And Multi-Feature Fusion
5	Research On Social Image Captioning Based On Deep Learning
6	Video Captioning Algorithms Based On Multi-head Attention Mechanism
7	Research On Video Captioning Based On Deep Learning
8	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning
9	Deep Multimodal Attention Learning For Image Captioning
10	Research On Academic Figure Captioning Based On Deep Learning