Egocentric-video captioning aims to analyze the given egocentric video using algorithms to automatically generate sentences that can describe the human activities or events depicted in the video.It can be widely applied in fields such as human-computer interaction,healthcare,education.However,the existing research on video captioning mainly focus on the third-person scene,and lacks related research work based on egocentric video.This thesis drew on these research works,analyzed their shortcomings and conducted research based on a sensor-augmented egocentric-video captioning dataset,aiming at the inherent problems in egocentric vision,such as limited field of view,missing body information of the subject,and proposed three egocentric-video captioning methods,enriching the research work on this task.The main research contents of this thesis are as follows:1.This thesis studies a region-based feature video captioning method.To address the problem of detail loss in global features used by mainstream methods and the difficulty of extracting temporal information from region features,this thesis constructed a Gating Region Recurrent Network inspired by the Gating Recurrent Unit(GRU)to fully extract local spatio-temporal information from the region feature sequence,enhancing the model’s ability to capture details and focus on key regions.In addition,this thesis designed a stacked GRU motion encoder and a multi-modal fusion mechanism that mutually guides and extracts motion features as auxiliary information to fuse with visual features of the video,obtaining rich and fine-grained multi-modal features.The proposed method can improve the accuracy and fluency of video descriptions.2.This thesis studied a feature grouping and graph convolution-based video captioning method.From the perspective of fully mining object-level features in the video,this thesis attempted to extract object-level features and their interaction relationships without relying on external object detectors.To this end,this thesis aggregated semantically similar features as an latent object feature based on the assumption that features from different positions of the same object are similar in the region feature sequence,and assigned explicit semantic information to the implicit object feature through a semantic consistency mechanism.Furthermore,this thesis constructed a latent object feature enhancement module based on graph convolution to model the interaction relationships between multi-modal object features,improving the model’s performance.3.In this thesis,we investigated a unified multi-modal Transformer for egocentricvideo captioning with end-to-end training.Compared with existing non-end-to-end training methods,this model can better adapt to the features required for egocentric-video captioning.To improve the recognition ability of the model for human actions of different durations,this thesis constructed a Swin1 D Transformer as a motion feature encoding model that can extract multi-scale temporal information from motion sensor data.Moreover,this thesis proposed a multi-branch attention module that combined two sparse attentions and a low-rank attention to address the problem of redundancy in video features,improving the quality of generated description sentences. |