Research And Implementation Of Video Captioning Algorithm Based On Key Frame Extraction And Cross-Modal Feature Fusion

Posted on:2021-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:L H Zhang

Full Text:PDF

GTID:2518306308467094

Subject:Computer technology

Abstract/Summary:

With the development of network technology,social media platforms generate videos and texts of huge data every day.Through video and text data,Internet users can obtain all kinds of information from society.Short videos are an important source of social information at present,and with the rise of this trend,how to ensure the legal and healthy content of short videos and create a safe and positive network environment has become a social concern.However,at present,the short video content screening is mainly performed by manual review.This method will consume a lot of time and manpower,and with the consumption of manpower,the screening results will not be accurate enough.With the development of artificial intelligence,how to reduce labor and time costs in the context of deep learning,and conduct more accurate screening and review of video content has become the focus of research.Video content identification can be implemented by video captioning algorithm.Video captioning is one of research hotspots in computer vision.At present,video captioning algorithms mainly have following problems:First,traditional algorithms use equal-interval sampling to extract video features,which causes the loss of key frames containing a large amount of semantic information,thus leading to the inaccuracy of video captioning.Moreover,equal-interval sampling method results in lots of redundant frames,thereby increasing the amount of computation of algorithms extremely.Second,traditional algorithms only consider temporal information when extracting features.However,for the image and video,the spatial features also contain rich latent semantic information.Only extracting temporal features will lead to inaccurate natural language descriptions.To address these problems,this paper proposes the video captioning algorithm based on key frame extraction and cross-modal feature fusion.In order to extract key semantic frames,knowledge graph is adopted to obtain key semantic information of video frames,and knowledge reasoning is used to obtain the correlation among entities in the knowledge graph.In order to extract spatial latent semantic information of video frames,spatial attention mechanism is combined with temporal features to generate accurate natural language descriptions.This paper evaluates the proposed algorithm on two benchmark datasets.Extensive experiments have been conducted and the results demonstrate that the proposed algorithm could achieve better video captioning performance than the state-of-the-art algorithms.

Keywords/Search Tags:

video semantics, natural language description, knowledge graph, spatio-temporal features

Related items

1	Research On Representation And Reasoning Of Fuzzy Spatio-Temporal Knowledge Based On Description Logics
2	Research On Natural Language Description Generation For Short Video In Self Media
3	Research On Video Content Description Method Based On Multi-scale Features And Temporal Semantics
4	Research On Representation Of Fuzzy Spatio-temporal Knowledge With Ontology And Construction Method Based On Petri Net
5	Research On Surveillance Video Synopsis Based On Spatio-Temporal Slice
6	Research On Video Behavior Classification Technology Based On Spatio-Temporal Features
7	Research On Video Copy Detection Algorithm Based On Spatio-temporal CNN Features
8	Spatio-temporal Features Analysis And Its Application Based On Graph Convolutional Networks
9	Formalize The Semantics Of Natural Language Understanding Research
10	Research On Methods Of Video Content Analysis Based On Spatio-temporal Variation