Font Size: a A A

Research And Implementation Of Long Video Captioning Technology Based On Deep Learning

Posted on:2022-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:T WenFull Text:PDF
GTID:2518306740483284Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the latest development of computer and network infrastructure,multimedia information such as images and videos is widely spread on the Internet.People's demand for video understanding continues to increase,and the short video captioning task based on deep learning methods have been extensively studied.However,the duration of a single video for traditional short video captioning tasks is usually between 5 seconds and 25 seconds,while the duration of massive videos in actual scenes is usually more than 30 seconds or even more than1 minute.Correspondingly,long video captioning must also accurately understand and analyze the events that occur in the long videos and the dependencies between objects and events.Therefore,long video captioning is still a challenging task.There are two main challenges in long video captioning tasks:(1)Long videos contain richer semantic information than short videos.The visual elements,objects,and scenes that appear in each long video have more uncertainties.(2)The existing research methods are difficult to ensure that the semantics generated by the model are correct and the captions conforms to human expression habits,and the readability is poor.In response to the above challenges,this thesis has carried out the research and implementation of long video captioning methods based on multi-modal deep learning.The main contents of the thesis are as follows:(1)This thesis researches on long video captioning technology.First,in view of the current situation that there are no datasets that meet the needs of long video captioning tasks,this thesis designs and implements the construction of long video captioning datasets according to the needs of real scenes.The whole set of processing methods can efficiently realize the automatic construction of the dataset,and according to this method,the Chinese and English long video captioning dataset Focus is constructed.This dataset contains 10920 long video clips,each of which is attached with video,audio and text files.(2)Based on the Focus dataset,this thesis proposes a long video captioning model named Bert-based Long Video to Text(BLVT).This model follows the idea of taking the rich text information contained in the long video as the main part and the visual information as the auxiliary part to extract the features of the long text information clauses in the long videos;At the same time,a reconstruction layer is proposed to further calculate the extracted features to obtain document-level features that incorporate context information.The experimental results of BLVT on the Focus dataset show that the text information in the long videos plays an important role in the long video captioning,and the text features after feature reconstruction layer can achieve better long video captioning effects.(3)On the basis of the above-mentioned long video captioning model,this thesis further studies the method of obtaining visual information.This thesis obtains the representative visual labels that can represent the overall or local features of a long video from three aspects: video category label,target detection label,and key person detection label.In addition,two methods are used to fuse these visual labels with text content.Finally,the model is applied to the Focus dataset constructed in this thesis,and the experimental results show that the fusion of visual information and text information can effectively improve the effect of long video captioning,and achieve good effect in both Chinese and English long video captioning tasks.In summary,this thesis studies the long video captioning method based on deep learning,and successfully applies it on the constructed dataset,providing a reference for the research task of long video captioning.
Keywords/Search Tags:Computer vision, Long video captioning, Convolutional neural networks, Text summarization, Bidirectional Encoder Representation from Transformers
PDF Full Text Request
Related items