Font Size: a A A

Research On Image-text Matching Based On Deep Learning

Posted on:2020-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZhangFull Text:PDF
GTID:2428330590973217Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of human-computer interaction,information often exists in multiple modals such as language,sound,images,etc.How to effectively teach computers to manage and understand cross-modal information has become a hot topic in artificial intelligence.This paper focuses on the image-text matching problem,that is,given a text description,retrieving the image areas or imagse that match it.We first divide the paper into two modules according to the open and closed vocabulary: the third and fourth chapters are open-vocabulary tasks;the fifth chapter is closed-vocabulary tasks.On the basis of the existing work,the third chapter proposes a hierarchical reward function to deal with the incomplete annotation and classes imbalance in the relevant datasets.For the shortcomings of the third chapter random sampling and pre-training models,the fourth chapter further proposes a online hard negative mining strategy,and the knowledge base module of the weak supervision domain was introduced for the first time in the visual-language matching task of supervised learning.In the method based on hierarchical reward function,we first analyze the problems that the cross-entropy loss commonly used in object detection is directly applied to the task of this paper;to alleviate these problems,we propose a hierarchical reward function,which dynamically generates different learning rates for the target,context,and background objects.In order to make the training and reasoning objectives as consistent as possible,we use the hierarchical reward function to approximate the R@K indicator as to the objective function and use the strategy gradient to solve the non-conductible function.In addition,we introduce a triple loss to further improve performance and use the confusion matrix to achieve simple to the difficult learning process.In the computing method of difficult context object mining,we score each image region through the classic visual-language matching network,and take the negative samples with higher scores as the difficult samples,and send them to the third stage network training together with the positive samples.The network has the same structure as the scoring network,but the input is a difficult visual sample and all phrases including contextual phrases.In addition,in order to make the object categories recognized by the model no longer limited to the categories in the pre-training model of the visual feature extraction module,we first introduced the knowledge base module in the visual-language matching task of supervised learning,and the text similarity is the bridge weights each image region to filter out the unclassified matching modules.In the method based on open vocabulary,that is,you can use any word in the input text.In contrast,closed-word value input text can only use words in the fixed vocabulary.In order to transform words that have not been learned into models into computable vectors,we use the Internet to dynamically mine visual representations of these words.The representative features are extracted from the noisy network data by self-similarity and the correlation matrix proposed in this paper.We validate the effectiveness of the proposed algorithm in the task-image matching task for sentence-image.In order to further prove the practical value of the algorithm,we collect the real travel data in the network and propose the TVN25 dataset,and carry out the "Travel Notes Visualisation" task on the data set.The algorithm in this chapter does not require manual labeling(weak supervision)and has high scalability,which is conducive to the large-scale commercial application of visual-language matching tasks.
Keywords/Search Tags:Computer Vision, Image-Text Matching, Open Vocabulary, Hard Negative Mining, Reinforcement learning
PDF Full Text Request
Related items