Font Size: a A A

Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning

Posted on:2022-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z ChangFull Text:PDF
GTID:2518306743474304Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of deep learning technology and the powerful learning ability of neural networks,video captioning,as a new cross-modal task connecting computer vision and natural language processing,has attracted extensive attention from scholars at home and abroad,and has achieved more and more authoritative and outstanding achievements.The goal of video captioning is to automatically generate text description for a given video clip,and the dataset is usually to annotate a short video clip.Dense video captioning is a branch of video captioning,which needs to analyze longer and more complex video sequences and generates text descriptions of multiple events in a long video.The main work of this paper will focus on two tasks:video captioning and dense video captioning.For video captioning,a multi-feature fusion method based on action reasoning is proposed for optimizing the interaction prediction between two objects,addressing the drawback that most methods produce actions that depend on object co-occurrence.This method explicitly targets actions for inference and improves the recognition of actions by extracting and modeling 2D convolutional features,3D convolutional features and local features of the video to capture better visual dynamics,thus further improving the quality of description.Extensive comparison experiments were conducted on the publicly available MSVD and MSR-VTT datasets.The experimental results show that the present model can successfully improve the description of video actions and achieve competitive scores in four metrics,BLEU4,METEOR,CIDEr,and ROUGE-L.For dense video captioning,a multi-modal fusion method based on event interactivity is proposed in this paper.It is used to solve the problem of multiple event descriptions in the same video without continuity,correlation and lack of capturing audio information in the video.The current step of dense video captioning task is to first locate the contained events in a long video,and then perform video captioning for each event,so that the description of each event can be generated,but such generated descriptions lack interaction between events.Multiple events in the same video should be connected and not independent of each other.To address this problem,we put forward an event interactivity approach,where we model the temporal and semantic relationships between different events in the event localization phase to establish the interaction,thus generating more consistent and continuous descriptions.In addition,we extract the visual and audio features of the videos to further improve the accuracy of the descriptions from a multi-modal perspective.In this paper,we conducted sufficient experiments on publicly available datasets and achieved a Meteor score of9.64 on the Activity Net dataset,which is a 31.8% improvement over the mainstream model MDVC,achieving a performance that can compete with current state-of-the-art models.
Keywords/Search Tags:Deep learning, Video captioning, Multi-feature fusion, Dense video captioning, Multi-modal fusion
PDF Full Text Request
Related items