Font Size: a A A

Researches On Short Video Captioning Based On Deep Learning

Posted on:2022-03-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:H H XiaoFull Text:PDF
GTID:1488306569470254Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,video data,especially short video,has shown an explosive growth trend.It's the most critical step for machine to understand the video content in video intelligent analysis.The ability to describe video in natural language makes many important applications possible,such as content-based video retrieval,automatic video surveillance and humancomputer interaction,etc.Thus,video captioning has attracted much attention from both computer vision and natural language processing communities.This paper focuses on short video captioning,and the contributions are summarized as follows:1)The key to video captioning is to describe the video content accurately.Considering that the current methods focus visual attention on non visual words may lead to some unnecessary gradient updates or even mislead the model,this paper proposes an adaptive attention mechanism to alleviate this problem.Meanwhile,traditional training methods optimize the model using cross entropy loss,which only adjusts the model at the word level and without considering the sentence level optimization.To this end,this paper constructs a reinforcement learning loss to directly optimize the model from sentence level,and optimizes the entire captioning system in a mixed loss method based on the adaptive attention model.2)Visual attributes are more and more popular for enhancing the performance of video captioning.In order to effectively utilize the visual attributes of video,this paper proposes an attribute selection mechanism to filter the detected visual attributes.Furthermore,regarding the large differences of video length and text length in the movie datasets,this paper proposes an adaptive frame cycle filling method to provide as many feature inputs as possible for the network.3)Most existing methods predict one word at a time,and by feeding the last generated word back as input at the next time,while the other generated words are not fully exploited.At the same time,previous methods did not consider the learning situation of training samples in the training process,resulting in many unnecessary training.To address these issues,this paper proposes a text-based dynamic attention model named TDAM,which imposes a dynamic attention mechanism on all the generated words with the motivation to enhance the overall control of the whole sentence.In addition,TDAM is trained through two steps: “starting from scratch” and “checking for gaps”.The former uses all the samples to optimize the model,while the latter only trains for samples with poor control.4)In addition to accuracy,there are two other characteristics of text description: diversity and meticulosity.Exploring the diverse description and fine-grained description of video is also an important process of the development of machine towards personification.Therefore,this paper proposes a multi-description method based on the fully convolutional network and the conditional generative adversarial network(CGAN).At the same time,a diversity evaluation metric DCE is proposed to quantitatively analyze the diverse captions.For the meticulosity of video captioning,a novel hierarchical architecture that combines long short-term memory(LSTM)and convolutional architecture is proposed in this paper to generate fine-grained description.The model is optimized from different directions by using the dual-stage loss.It uses convolutional neural network(CNN)to construct the fragment-level feature,and captures the detailed behavior information of video by combining the attention mechanism.Besides,a novel performance evaluation metric named LTMS is proposed to assess the fine-grained captions.In addition,this paper also extends the method of short video captioning to the dense captioning of long video,and proposes a novel feature loss to enhance the discrimination of different events in long video.Experiments show that our proposed method for short video captioning can improve the performance of the dense captioning system of long video.
Keywords/Search Tags:Video captioning, Reinforcement learning, Attention mechanism, Fully convolutional network, Dense captioning
PDF Full Text Request
Related items