Font Size: a A A

Video Captioning Research Method Based On Transformer Network And Bidirectional Decoding

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2518306497452184Subject:Intelligent information processing
Abstract/Summary:PDF Full Text Request
In recent years,the research in the field of deep learning such as computer vision(CV)and natural language processing(NLP)has been developed rapidly.More and more researchers have begun to pay attention to multimodal tasks,such as video captioning.Video captioning refers to the automatic generation of natural language description sentences that conform to human expression habits and correct coherence given a 10-25 second video clip.Video captioning,as a basic research task,has a wide range of application scenarios in real life,such as video retrieval,video content review and assistance to visually impaired people.Among the existing researches on video captioning,encoder decoder framework is widely used.In these researches,convolutional neural networks(CNN)is used as encoder to extract visual semantic information of video;recurrent neural networks(RNN)is used as decoder to generate corresponding video description.Although these studies have made great progress,CNN is not good at capturing video sequence representation.Recently,transformer network based on encoder decoder structure has achieved remarkable results in the field of natural language processing,which has good sequence representation ability.Therefore,this paper proposes a research method of video captioning based on transformer network.The encoder of this network can effectively integrate multiple modal features of video,and learn the sequence information representation of video frames,so as to better model the relationship between video and semantic description in decoding,and to generate more appropriate video description.In addition,the existing methods based on encoder decoder framework do not make full use of the context semantic information,that is,only the context semantics of the positive sequence video description is used,but the context semantics of the reverse video description is ignored.Therefore,this paper proposes a video description method based on bidirectional decoding,which can use the context semantics of the positive and reverse sequence video description.Specifically,the method can capture the context information of the reverse video description sentence by constructing an additional backward decoder to decode from right to left.Only after the backward decoder is decoded,the forward decoder decodes in the way of left to right.The bidirectional decoder can make full use of the context semantics of video,thus generating video description sentences that meet the human expression habits and have better quality.In this paper,full experiments are performed on two benchmark datasets MSVD,MSR-VTT-10 K.Firstly,compared with CNN,transformer network model can generate better video sequence representation,and multi-modal video features are better than single-modal video features.In addition,we take the video description method based on transformer network as the benchmark model,and prove the effectiveness of bidirectional decoding through experiments.
Keywords/Search Tags:Video Captioning, Bidirectional Decoding, Encoder-Decoder
PDF Full Text Request
Related items