Font Size: a A A

Deep Learning For Video Description

Posted on:2020-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1368330575957141Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video Description is the task aims to generate descriptive natural language for videos.Not only in the social network,but also in the monitoring and human-computer interaction system,video description has a huge application value.Video description not only needs to process videos,for extracting visual information such as objects,verbs and the relationship between objects,but also needs to generate descriptions under correct grammar rules.This task has a great research significance,which could promote the development of multimodal fusion and interaction,and inspire other multimodal tasks to learn better shared information.The video description task usually applies a Convolutional Neural Network-Long Short Term Memory(CNN-LSTM)encoder-decoder model to describe video clips.Though this model has made some achievements,there are still several problems need to be solved.Specifically,the model learns insufficient linguistic information,builds little correlation between vision and text,and lacks of the textual supervision when extracting the visual attention.In order to solve these problems,this paper proposes three models.To summarize,the main contributions of this paper are presented as follows.1)This dissertation proposes a Video Description model with Subject-Verb-Object Supervision(VD-SVOs).Based on the CNN-LSTM structure,the VD-SVOs model incorporate a Subject-Verb-Object(SVO)classifier onto LSTM.SVO is a skeleton structure,including the main semantic and basic syntax meanings of the sentences,which plays an important role in improving the quality of descriptive sentences.The VD-SVOs model is evaluated on the publicly available dataset,i.e.Youtube2Text dataset.The BLEU-4 score of 28.29%demonstrates that the generated sentences are under correct grammar rules and the scores outperforms baseline models.2)This dissertation also proposes a Video Description model with Integrated information of Vision and Text(VD-ivt).This model incorporates two channels to promote the integration between visual and textual modalities.The VD-ivt model consists of three parallel channels:a basic CNN-LSTM structure for sentence generation,a sentence to sentence channel for learning linguistic information,and the third channel receives the visual and texual information in turn to get the integrated information for enhancing the relationship between them.The compared METEOR results of 29.84%and 7.5%on the Youtube2Text and LSMDC dataset show that the VD-ivt model outperforms other baseline models,demonstrating the effectiveness of fusion channel.Additionally,the analysis of visualization illustrates the fusion channel help the model learn shared information between two modalities.3)This dissertation proposes an Image Caption model with Synchronous Cross-Attention(IC-SCA).This model has two stages,visual and textual,which jointly model the multimodal information to generate the descriptions.The attention mechanism predicts the current attention based on the previous visual content.And to enhance the impact of words,that visual content is chosen by the previous word,with respect to the unknown of current word.The attention hence incorporates the order and the information of the words.This paper leverages this attention to guide the language model for generation.The IC-SCA model is evaluated on one of the largest datasets for image caption.Experimental results of 100%CIDEr score on the MS-COCO dataset demonstrate that the IC-SCA model outperforms the benchmarks.By attention visualization,the effectiveness of our proposed mechanism is also verified.4)A "Blind Eye"system on smartphone and webpage is designed and developed based on the deep video description models.The system on the webpage can generate descriptions for the uploaded videos.The application on the smartphone can record a video,generate the description and 'say' out.The "Blind Eye" system aims to help visually impaired people with information of the current scene,which could give them a great convenience for their lives.
Keywords/Search Tags:Video Description, Image Caption, Deep Learning, Convolutional Neural Network, Long Short-Term Memory
PDF Full Text Request
Related items