Font Size: a A A

Research On Video Description Method Based On Deep Learning

Posted on:2024-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:X F ChenFull Text:PDF
GTID:2568307157982739Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
In the era of digital media,multimedia technology has developed rapidly,and video production and sharing have become easier and easier.However,due to differences in culture,language,listening and comprehension,many viewers cannot intuitively understand video content.Video description technology came into being.Video description aims to present the video content in the form of natural language to enhance the comprehensibility and accessibility of the video,making it easier for viewers to understand and digest the content of the video.Video description is a challenging task.Most of the current video description research focuses on the behavior description of the video subject,mainly focusing on the overall understanding of the video content.This research trend has led most video description work ignoring the feature capture of small objects in the video,making it difficult for the generated natural language to fully describe all objects present in the video,resulting in vague,imprecise,or even incorrect description statements.In addition,existing video description works either use average pooling to directly discard temporal features or use recurrent neural networks for simple video temporal feature extraction.These methods have insufficient retention of long-range temporal dependencies of videos,resulting in incoherent and poor readability of the generated natural language.Therefore,studying the precise description of small objects in videos,and preserving video temporal dependencies more optimally is of great significance for improving the accuracy of video descriptions.In response to the above problems,this topic has launched research on video description methods based on deep learning.The main contents are as follows:(1)A novel video description framework MSLR is proposed,which mainly consists of a multi-scale information extraction module(MS)and a long-range temporal dependency extraction module(LR)to build a richer visual semantics by combining multi-scale information and long-range dependencies of videos.A Bidirectional GRU is built as a language model to translate visual features into natural language.Experimental results show that the framework can generate a more coherent video description language and generate more fine-grained descriptions for small objects in the video.The MSLR framework achieves performance scores exceeding most state-of-the-art methods on common metrics on two widely used video description datasets.(2)A semantic information embedding method for objects in videos is proposed.The method first trains an object detector on the large-scale image recognition dataset Image Net,and feeds the video to the trained object detector to extract the object collection of this video.Second,the Enhanced Transformers Bidirectional Encoded Representation(RoBerta)model is trained on a large corpus to convert the textual information of video object sets into semantic features.Finally,the object semantic features are embedded into the visual information vector space of the MSLR framework to generate a more comprehensive video representation.Experimental results show that this method further optimizes the description effect of video objects in natural language,and further improves the scores of some metrics of the MSLR framework on general datasets.
Keywords/Search Tags:Video description, Multi-scale information, Long-range temporal dependencies, Semantic information embedding
PDF Full Text Request
Related items