Font Size: a A A

Research On Video Captioning Based On Deep Learning

Posted on:2020-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:L SunFull Text:PDF
GTID:2428330572974163Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Artificial intelligence can be roughly divided into two research directions:per-ceived intelligence and cognitive intelligence.Perceived intelligence research has made great progress,such as image classification,natural language translation,but cognitive intelligence development speed limits,such as visual description.Combining natural language with computer vision is beneficial to building a bridge between human and machine and promotes the study of cognitive intelligence.Due to the development of deep learning technology in recent years,establishing a connection between video and natural language will be regarded as the ultimate goal of video understanding.Video captioning is a research hotspot in the field of computer vision and natural language processing.It has broad application prospects in the fields of video retrieval,human-computer interaction and visual obstacle assistance.Unlike label-type coarse-grained visual understanding tasks such as video classification and object detection,video captioning needs to describe the video with a smooth and accurate sentence,which requires not only recognizing objects,but also understanding the relationship between objects in video.At the same time,due to the variety of video content description styles,such as the abstract description of the scene,the description of the relationship between the objects,the description of the object behavior and motion in the video,etc.,this will bring great challenges.The traditional video captioning algorithm are mainly based on the language template method and the retrieval method.The Language template-based methods,due to the limitations of fixed templates,can only generate sentences with a single form of inflexibility.The retrieval-based method relies too much on the size of retrieval video database,when there is no video similar to the video to be described in the database,the generated description will deviate greatly from the video content.These two methods require complex preprocessing of the video in the early stage,and insufficient optimization of the language sequence of the back end,resulting in poor quality of the generated sentence.With the advancement of deep learning,the sequence learning model based on encoder-decoder has made a breakthrough in video captioning.In this thesis,we will study some video captioning algorithms,the main work is summarized as follows:1.A novel deep architecture is proposed for video captioning,named Multimodal Semantic Attention Network.The key to the video captioning problem lies in the ex-traction of video features.Since different modal information in the video can assist each other,encoding multimodal information can help to mine more semantic information.And due to the general video captioning algorithm only considers the video features and ignores the high-level semantic attribute information of the video,in order to improve the quality of generating the description sentences,this thesis also discusses how to ex-tract the high-le'vel semantic attributes and apply the semantic attributes to the video captioning task.In the encoding phase,we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem.Moreover,we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encod-ing.In the decoding phase,we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices,and employ attention mechanism to pay attention to different attributes at each time of the captioning process.We eval-uate our algorithm on two popular public benchmarks,achieving competitive results with current state-of-the-art across six evaluation metrics.2.This thesis analyses and studies the problem of insufficient optimization of lan-guage generation at decoder.Most of video captioning algorithms and the new model proposed in this thesis use the maximum likelihood to model the language,and use the cross entropy loss to train network,which will bring two obvious defects:First,the exposure bias problem,at the train stage,the input of the decoder comes from the real words in the training set at each moment.But at the test stage,the words predicted from the previous time through greedy search or beam search are input at each moment.If one of the words is not predicted accurately enough,errors may be passed down,result-ing in worse and worse quality of the words generated later.Second,the problem of inconsistent training indicators and evaluation criteria.In the training phase,the cross entropy loss function is used to maximize the posterior probability,and the objective evaluation criteria such as BLEU,METEOR,and CIDEr are used in testing phase.This inconsistency leads to inadequate optimization of evaluation indicators for video cap-tioning.In order to solve the above two problems,a reinforcement learning algorithm based on self-critical sequence training is introduced to improve our new model.The model is further trained by directly optimizing objective evaluation criteria,and then the effectiveness of this method in video captioning is proved by experiments.
Keywords/Search Tags:Video captioning, Multimodal, Attention mechanism, Semantic attribute, Deep learning
PDF Full Text Request
Related items