Font Size: a A A

Research On Saliency Regions Generation Based On Captions

Posted on:2020-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:T F XingFull Text:PDF
GTID:2428330578479632Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image Caption and Video Caption are the research directions that have emerged in recent years.They spans computer vision and natural language processing,attracting the attention of researchers.Despite their excellent performance,They hardly provide any insight of the mapping learned internally between the image and the caption.This article uses caption guidance to achieve top-down saliency with deep models,revealing the inner workings of image or video caption models.This paper studies the spatial feature,temporal feature and mapping method of the model.The main research works are as follows:(1)In view of the problem that the traditional methods of Caption-guided Saliency does insufficient use spatial feature of image,this paper proposes gradient-weighted target activation mapping saliency area generation method.The method uses the global average pooling to provide the spatial information of the global level in the image,and generates word-guided weighted convolutional feature by Multiplying the generated weights by the word guidance with the feature map of the last convolutional layers of convolutional neural network,thereby a feature map carrying global and local spatial information is introduced;finally,a salient region with spatial information is generated by the ReLU activation function.Experiments on the Flickr30k dataset demonstrate that the proposed method can effectively improve the performance of significant detection.(2)In view of the problem that the traditional methods of Caption-guided Saliency mostly adopts single frame information and all words in the caption to guide saliency generation that result in the lack of temporal information and the meaningless words affecting the accuracy of the generation of significant regions.A method for generating a salient region of spatio-temporal feature fusion guided by the notional-word is proposed.The method uses the Natural Language Tool Kit(NLTK)to extract the notional-words from the caption itself,reducing the interference of the meaningless words on the salient region generation;using the bilinear pooling operation to convolve the convolutional neural network extracted static features and 3D Converlutional Neural Networks(C3D)extracted spatio-temporal features to form enhanced spatio-temporal features;And spatio-temporal features use multi-layer Long Short Term The Memory(LSTM)architecture obtains the probability distribution p and then the key frames in each video segment are fed into the multi-layer LSTM architecture to obtain probability distribution q,and finally the two probabilities are used to generate significant regions using KL divergence.and finally the two probabilities are used to generate significant regions using KL divergence(3)In view of the problem that the traditional methods of Caption-guided Saliency use the single coding method and the traditional LSTM network can not effectively capture the most effective feature information.This paper proposes the salient region of the enhanced discriminant network under the hierarchical semantic guidance of caption.The method uses Word Emdedding,one-dimensional convolution,and Bi-directional Gated Recurrent Units(Bi-GRU)encoding to obtain word-phrase-sentence level features,and achieves multiple mapping of caption to images or video frames.Replacing the original LSTM tanh activation function with the ReLU activation function,to obtain a non-negative neuron signal in each LSTM,and at the same time invert the weight to obtain a non-positive neuron signal.The device obtains two probability distributions.Finally,KL divergence is used to calculate the information gain of a single video frame and video frame sequence to achieve salient regional positioning of the image or video frame.The two significant images are subtracted from the intersection to obtain a more distinguishing saliency map.Experiments show that hierarchical semantics can effectively realize the diversified connection between text and image,and at the same time strongly discriminate the network to distinguish more complex environments.This method greatly improves the performance of the model.
Keywords/Search Tags:image caption, video caption, saliency, nonlinear activation function, convolutional neural networks
PDF Full Text Request
Related items