Video Captioning Research Method Based On Transformer Network And Bidirectional Decoding

Posted on:2022-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2518306497452184

Subject:Intelligent information processing

Abstract/Summary:

PDF Full Text Request

In recent years,the research in the field of deep learning such as computer vision(CV)and natural language processing(NLP)has been developed rapidly.More and more researchers have begun to pay attention to multimodal tasks,such as video captioning.Video captioning refers to the automatic generation of natural language description sentences that conform to human expression habits and correct coherence given a 10-25 second video clip.Video captioning,as a basic research task,has a wide range of application scenarios in real life,such as video retrieval,video content review and assistance to visually impaired people.Among the existing researches on video captioning,encoder decoder framework is widely used.In these researches,convolutional neural networks(CNN)is used as encoder to extract visual semantic information of video;recurrent neural networks(RNN)is used as decoder to generate corresponding video description.Although these studies have made great progress,CNN is not good at capturing video sequence representation.Recently,transformer network based on encoder decoder structure has achieved remarkable results in the field of natural language processing,which has good sequence representation ability.Therefore,this paper proposes a research method of video captioning based on transformer network.The encoder of this network can effectively integrate multiple modal features of video,and learn the sequence information representation of video frames,so as to better model the relationship between video and semantic description in decoding,and to generate more appropriate video description.In addition,the existing methods based on encoder decoder framework do not make full use of the context semantic information,that is,only the context semantics of the positive sequence video description is used,but the context semantics of the reverse video description is ignored.Therefore,this paper proposes a video description method based on bidirectional decoding,which can use the context semantics of the positive and reverse sequence video description.Specifically,the method can capture the context information of the reverse video description sentence by constructing an additional backward decoder to decode from right to left.Only after the backward decoder is decoded,the forward decoder decodes in the way of left to right.The bidirectional decoder can make full use of the context semantics of video,thus generating video description sentences that meet the human expression habits and have better quality.In this paper,full experiments are performed on two benchmark datasets MSVD,MSR-VTT-10 K.Firstly,compared with CNN,transformer network model can generate better video sequence representation,and multi-modal video features are better than single-modal video features.In addition,we take the video description method based on transformer network as the benchmark model,and prove the effectiveness of bidirectional decoding through experiments.

Keywords/Search Tags:

Video Captioning, Bidirectional Decoding, Encoder-Decoder

PDF Full Text Request

Related items

1	Research On Image Captioning Algorithm Based On Encoding And Decoding
2	Research And Implementation Of Long Video Captioning Technology Based On Deep Learning
3	Research Of Video Semantic Caption Generation Based On Reconstructing Features
4	Research On Video Captioning Based On Deliberation Mechanism
5	Image Captioning Based On Mutual-aid Bidirectional LSTM And Progressive Decoding Mechanism
6	Research On Semantic-Attentive Deep Image Captioning Method
7	Image Captioning Based On Deep Recurrent Convlution Network And Spatio-temporal Information Fusion
8	Design And Optimization Of H.264 Encoder And Decoder System
9	Otns Decoding Module Of Network Chip Design
10	Design And Implementation Of LDPC Encoder And Decoder For 5G Communication Systems