Font Size: a A A

Research On Video Caption Based On Deep Learning Sequence Model

Posted on:2021-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:X HaoFull Text:PDF
GTID:2428330632962846Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video description is a research hotspot in the field of computer vision,with a wide range of application scenarios,including video retrieval,video understanding and other fields.The definition of video description is to describe a video segment with a short English sentence.At present,there are still many deficiencies in the research in this field.The purpose of this paper is to study from the following angles to improve the accuracy of sentence generation by video description algorithms.Video is a complex carrier containing different modal information,and its main input is visual and audio information.Different modal information can capture the characteristics of different dimensions to form a complement.Visual information covers most of the video information,audio information can effectively gain visual information.The current main research is still focused on the field of visual modalities.Intuitively,the fusion of audio information and visual information can improve the accuracy of video description generation.Therefore,how to efficiently encode the visual information contained in the video and how to effectively fuse the information of different modalities is one of the challenges facing the current video description generation.In recent years,deep learning with neural networks as the core has achieved successful practice in countless fields.Deep neural networks with autonomous learning capabilities have become the primary choice for video description generation tasks.The latest research practice and conclusion also prove its superiority.In this paper,the overall frame and local structure of the algorithm are improved to solve the problems of visual information hierarchy blurring and modal fusion in video description.The details are as follows:1.The current research on constructing video description tasks based on visual information ignores the hierarchical characteristics of video structure information.Taking the improvement of the ability to recognize scene switching information in visual information as the starting point,this paper proposes a video description generation network based on scene edge detection encoder.In the coding stage,the network adaptively learns whether to reach the switching point of the edge of the scene,which can give the visual feature to encode more hierarchical structure information.The effectiveness of the network model was verified from a qualitative and quantitative perspective,and quite competitive results were obtained on two public data sets.2.Current research on video description tasks ignores the effect of audio timing information on the attention mechanism in the process of text sequence generation.In this paper,taking the increase of the contribution of audio information in the text attention mechanism as an entry point,a video description network based on audio and video multi-modal attention mechanism is proposed.By designing audio and visual information to participate in the decision text generation process,the audio Information can supplement visual information.The validity of the model is analyzed from a qualitative and quantitative point of view,and good results are obtained on public data sets.3.In this paper,a video description algorithm model based on multi-layer audio and video cross-modal attention is proposed based on the comprehensive use of feature coding at different levels and different granularities to represent the video.By using multi-layer encoders to encode at both ends of the audio and video,the feature vectors of the video in different modes and different levels are obtained.Then through the multiple attention mechanism in the model,a total of four kinds of features from two modalities at different levels are fused and expressed,which improves the expression ability of video feature encoding.The experimental results based on this model have achieved certain improvement effects on various evaluation indicators on the public data set.
Keywords/Search Tags:video caption, encoder-decoder, multi-modal attention, multi-layer encoder
PDF Full Text Request
Related items