Font Size: a A A

Research On Semantic-Aware Based Video Captioning

Posted on:2023-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z X SunFull Text:PDF
GTID:2558307118496264Subject:Computer Science and Technology
Abstract/Summary:
Video Captioning aims to describe the main content of videos in natural language automatically.As a cross-modal task connecting computer vision and natural language processing,video captioning plays a pivotal role in artificial intelligence technology.In recent years,with the development of deep learning in the fields of computer vision and natural language processing,methods such as long short-term memory network(LSTM),Transformer network,attention mechanism,and reinforcement learning have also been successfully applied to video captioning.But as an extremely challenging task,video captioning still has a lot of room for improvement.This paper designs corresponding network structures to improve the performance of video captioning from the two major directions of semantic loss and semantic association,specifically:1)In the video captioning model,the words in the ground-truth are generally used as the input in the training stage,and in the test stage,the model uses the words generated by the model during the previous time step,which causes the difference between the training and testing phases.It is difficult to train the entire network structure based on the previous word generated by the model.Combining the above two decoding methods,this paper proposes a dual-stream decoder,which designs a new branch to utilize the semantic information of the previously generated words.In the testing phase,this paper fuses the information of the traditional decoder and the selflearning decoder in a certain proportion to generate the final word.2)In the video captioning model,the autoregressive generation method is used to construct the temporal dependency between words and visual features,but this method of generating sentences from left to right word by word fails to directly construct bidirectional dependencies between words and visual features.Inspired by the word to vectors models,this paper proposes a local semantic extractor to exploit bidirectional local semantic information.Specifically,for the module network with a complex network structure,we propose a convergent semantic extractor to constrain features,and apply it to the visual part and the language part to learn the local association between visual features and word features.In addition,for the lightweight network structure,this paper proposes a divergent semantic extractor,which supervises the generation of the current word through the adjacent word information,to promote the model to perceive the adjacent semantic information.3)Although the local semantic extractor can fuse adjacent bidirectional semantic information,it does not consider the correlation between words in the whole sentence.This paper further designs a global semantic extractor to supervise the generation of the current word with global semantic information.Specifically,this paper calculates the correlation matrix according to the cosine similarity between the semantic features of the words in the sentence and obtains the global semantic information corresponding to the current word through the similarity matrix to assist the generation of the current word,so that the model is aware of global semantic information when generating the current word.4)This paper further integrates the dual-stream decoder network structure based on semantic loss and the global semantic extractor based on semantic association,and uses reinforcement learning to optimize the entire network structure.The network structure proposed in this paper is tested on Microsoft Video Description(MSVD)dataset and MSR-Video to Text(MSR-VTT)dataset,and the effectiveness of the model is verified from both quantitative and qualitative aspects.From a quantitative perspective,the general evaluation metrics BLEU-4,METEOR,ROUGE_L,and CIDEr are used to measure the quality of the designed video captioning model.The effectiveness and advancement of the network structure proposed in this paper are verified by ablation experiments and comparisons with similar methods in recent years.Among them,the fusion network optimized by reinforcement learning improves the CIDEr by 6.1% on MSR-VTT dataset.From a qualitative point of view,this paper selects the generated descriptions of some video cases,compares and analyzes the video captioning models,and demonstrates the superiority of the proposed network model in specific cases.
Keywords/Search Tags:Video Captioning, Dual-stream Network, Local Semantic Extractor, Global Semantic Extractor, Fusion Network
Related items