Font Size: a A A

Research And Implementation Of Video Captioning Algorithm Based On Key Frame Extraction And Cross-Modal Feature Fusion

Posted on:2021-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:L H ZhangFull Text:PDF
GTID:2518306308467094Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of network technology,social media platforms generate videos and texts of huge data every day.Through video and text data,Internet users can obtain all kinds of information from society.Short videos are an important source of social information at present,and with the rise of this trend,how to ensure the legal and healthy content of short videos and create a safe and positive network environment has become a social concern.However,at present,the short video content screening is mainly performed by manual review.This method will consume a lot of time and manpower,and with the consumption of manpower,the screening results will not be accurate enough.With the development of artificial intelligence,how to reduce labor and time costs in the context of deep learning,and conduct more accurate screening and review of video content has become the focus of research.Video content identification can be implemented by video captioning algorithm.Video captioning is one of research hotspots in computer vision.At present,video captioning algorithms mainly have following problems:First,traditional algorithms use equal-interval sampling to extract video features,which causes the loss of key frames containing a large amount of semantic information,thus leading to the inaccuracy of video captioning.Moreover,equal-interval sampling method results in lots of redundant frames,thereby increasing the amount of computation of algorithms extremely.Second,traditional algorithms only consider temporal information when extracting features.However,for the image and video,the spatial features also contain rich latent semantic information.Only extracting temporal features will lead to inaccurate natural language descriptions.To address these problems,this paper proposes the video captioning algorithm based on key frame extraction and cross-modal feature fusion.In order to extract key semantic frames,knowledge graph is adopted to obtain key semantic information of video frames,and knowledge reasoning is used to obtain the correlation among entities in the knowledge graph.In order to extract spatial latent semantic information of video frames,spatial attention mechanism is combined with temporal features to generate accurate natural language descriptions.This paper evaluates the proposed algorithm on two benchmark datasets.Extensive experiments have been conducted and the results demonstrate that the proposed algorithm could achieve better video captioning performance than the state-of-the-art algorithms.
Keywords/Search Tags:video semantics, natural language description, knowledge graph, spatio-temporal features
PDF Full Text Request
Related items