Research Of Video Captioning On Egocentric Videos

Posted on:2024-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:S S Wang

Full Text:PDF

GTID:2568307079465944

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Egocentric-video captioning aims to analyze the given egocentric video using algorithms to automatically generate sentences that can describe the human activities or events depicted in the video.It can be widely applied in fields such as human-computer interaction,healthcare,education.However,the existing research on video captioning mainly focus on the third-person scene,and lacks related research work based on egocentric video.This thesis drew on these research works,analyzed their shortcomings and conducted research based on a sensor-augmented egocentric-video captioning dataset,aiming at the inherent problems in egocentric vision,such as limited field of view,missing body information of the subject,and proposed three egocentric-video captioning methods,enriching the research work on this task.The main research contents of this thesis are as follows:1.This thesis studies a region-based feature video captioning method.To address the problem of detail loss in global features used by mainstream methods and the difficulty of extracting temporal information from region features,this thesis constructed a Gating Region Recurrent Network inspired by the Gating Recurrent Unit(GRU)to fully extract local spatio-temporal information from the region feature sequence,enhancing the model’s ability to capture details and focus on key regions.In addition,this thesis designed a stacked GRU motion encoder and a multi-modal fusion mechanism that mutually guides and extracts motion features as auxiliary information to fuse with visual features of the video,obtaining rich and fine-grained multi-modal features.The proposed method can improve the accuracy and fluency of video descriptions.2.This thesis studied a feature grouping and graph convolution-based video captioning method.From the perspective of fully mining object-level features in the video,this thesis attempted to extract object-level features and their interaction relationships without relying on external object detectors.To this end,this thesis aggregated semantically similar features as an latent object feature based on the assumption that features from different positions of the same object are similar in the region feature sequence,and assigned explicit semantic information to the implicit object feature through a semantic consistency mechanism.Furthermore,this thesis constructed a latent object feature enhancement module based on graph convolution to model the interaction relationships between multi-modal object features,improving the model’s performance.3.In this thesis,we investigated a unified multi-modal Transformer for egocentricvideo captioning with end-to-end training.Compared with existing non-end-to-end training methods,this model can better adapt to the features required for egocentric-video captioning.To improve the recognition ability of the model for human actions of different durations,this thesis constructed a Swin1 D Transformer as a motion feature encoding model that can extract multi-scale temporal information from motion sensor data.Moreover,this thesis proposed a multi-branch attention module that combined two sparse attentions and a low-rank attention to address the problem of redundancy in video features,improving the quality of generated description sentences.

Keywords/Search Tags:

Egocentric-Video Captioning, Sensors, Multi-modal Feature Fusion, Attention Mechanism

PDF Full Text Request

Related items

1	Research On Video Captioning Methods Based On Feature Fusion And Attention Mechanism
2	Research On Multi-Modal Video Captioning
3	Research And Application Of Video Captioning Technology Based On Deep Learning
4	Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning
5	Research On Egocentric Action Recognition Method Based On Cross-modal Fusion
6	Video Captioning Algorithms Based On Multi-head Attention Mechanism
7	Research On Video Captioning Algorithm Based On Attention Mechanism
8	Research On Social Image Captioning Based On Deep Learning
9	Research On Semantic Guiding Video Captioning Methods With Attention Mechanism And Memory Network
10	Research On Image Captioning Based On Image Feature Fusion