Font Size: a A A

Visual Content Description Technology Research Based On Deep Learning

Posted on:2019-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:L K LiFull Text:PDF
GTID:2348330569487670Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Computer vision technology based on deep learning has recently received a large amount of interest in research on artificial intelligence technology,which involves a wide range of fields such as object recognition,object detection,behavior recognition and visual content description.Visual content description technology is an important fundamental research that embodies the development of computer vision,and it shows how artificial intelligence technology can effectively establish a bridge between vision and language.In order to further study the visual content description technology based on deep learning,this study focused on the video description.The main contents were summarized as follows:Starting on the question of how to effectively relate the visual features and the semantic features,we proposed video description model of multi-attention mechanism of spatial-temporal and channel feature.Video description model is to generate a text description to express content of a video,which involves both computer vision and natural language processing,and how to effectively bridge the two domains is hot spot of current research.We analyzed the model in the process of statement description,the video features needed were different,and the video features were diverse,which contained the temporal features,spatial features,behavior features and so on,so when it generates corresponding words,the required features are specific.Based on this,we weight the attention mechanisms of the video's temporal,spatial and channel features,and apply it to the LSTM(Long-short Term Memory Cell)decoding module,to effectively build a model for vision features and semantic features.The multi-attention mechanism model we proposed was test and compared on MSVD and MSR-VTT,and the results showed that our model did better in some aspect.At the same time,we also proposed a saliency analysis method based on video description.Using the existing model,we proposed a top-down saliency analysis model and using visualization method visually proved the effectiveness of the model.By studying the relationship between the naturalness,fidelity,and diversity of the sentences in the video description model and the training mechanism of the video model,we proposed a video description model based on GAN(Generative Adversarial Networks)and RL(Reinforcement Learning).Current video description model framework is trained by maximum likelihood estimation(MLE).In fact,this method evaluates the accuracy of a word generated at each time step,and does not consider as a description sentence,it should start from the entirety,and evaluate in multi aspect.For a caption,it should have these features: fidelity,naturalness,diversity.Aiming at these features,we used the characteristics of the GAN network,applied GAN to the video description model.The generation network(G network)in this study was introduced diversity random variables,to enable the model to generate a variety of descriptions,and in the discriminant network(D network),a joint objective function was proposed to enhance evaluation performance of sentences produced by the G network in those three features.Secondly,we proposed training the GAN network based on reinforcement learning to solve the problem that GAN network cannot train discrete data.We conducted related experiment and analysis for the model,and the results showed that the GAN network-based video description model in this study can make description model generate more natural,real and diverse captions.
Keywords/Search Tags:Deep Learning, Video Description, Multi-Attention Mechanisms, Saliency, GAN Network
PDF Full Text Request
Related items