Font Size: a A A

Research And Applications Of Image-text Multimodal Correlation Learning

Posted on:2019-10-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y L BaiFull Text:PDF
GTID:1368330590972973Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Multimodal correlation learning is the backbone for multimedia understanding tasks with wide application scenarios.But it is also a challenging task because of the gap between representations in different input modalities.The key point of multimodal correlation learning is to the corresponding relationship between different input modalities.In this paper,we target at two fundamental research topics about image-text multimodal correlation learning including multimodal data association and multimodal correlation feature learning,and two of the related applications including crossmodal image retrieval and multimodal visual question answering.Firstly,there is a framework to enlarge human labeled image dataset from large-scale web images by multimodal data association according to image-text complementarity proposed in this dissertation.The method leverages both Web and deep convolutional neural network(DCNN),where Web provides massive images with rich contextual information,and DCNN replaces human to automatically label images under the guidance of Web contextual information.Experiments show that the proposed method can automatically scale up existing datasets significantly from billions of web pages with high accuracy and diversity.The performance on object recognition tasks and transfer learning tasks have been significantly improved by using the automatically augmented datasets,which demonstrates that more supervisory information has been automatically gathered from the Web.Secondly,a novelty deep convolutional neural network to learning image-text multimodal correlation feature is introduced.This DCNN learns high-level image representations and word representations jointly in one common continuous space.We also propose cross convolutional filter regularizer to accelerate the training process,with which the time cost of training progress cut in nearly a half.To validate the quality of learned features we use the learned image-text multimodal correlation feature to define word-word similarity and image-word similarity for constructing image dataset.These two similarities are used to automatize the two labor-intensive steps in manual-based image dataset construction:query formation and noisy image removal.Finally,new image datasets are constructed from scratch.In addition to scale,the automatically constructed dataset has comparable accuracy,diversity and cross-dataset generalization with manually labeled image datasets.Then,this dissertation focus on two related applications about multimodal correlation learning.For crossmodal image retrieval,three different frameworks are introduced.The first one is canonical correlation analysis based image retrieval model,in which the text representations learned from corpus and image representations learned from labeled image recognition task are projected into one common feature space by using simple linear transformation.The second one is multi-task based image retrieval model,in which we proposed a novelty training method for multi-task deep convolutional neural networks,and use noisy user click data as supervised information to learn image feature which is more suitable for image retrieval task.The last one is multimodal correlation features based image retrieval model,in which the image-text correlation features are used to measure the relevance between query and images,then re-rank the search results according to the similarity between images.The experimental results on a large scale open image retrieval task proved the effectiveness of our proposed multimodal correlation features based image retrieval method.The other application is multimodal visual question answering,which is regarded as one of the most challenge tasks in the multimedia area since it requires techniques about not only the image details understanding and the question semantic understanding,but also the reasoning capabilities for relations of image-question-answer triplets.In this thesis,a regression-based method to measure the correlation degree of image-question-answer triplets is proposed.Moreover,an attention-based neural tensor network is proposed for reasoning over relationships among image-question-answer triplets,and learning highlevel associations between question-image representations and answer representations.We applied our proposed framework to two different widely used VQA methods including MLB and MUTAN.The experimental results show that the proposed framework can significantly boost the previous methods.In summary,this dissertation introduces several solutions for image-text multimodal correlation learning and proved the value of our methods in two fundamental tasks including multimodal data association and multimoda correlation feature learning.Meanwhile,the performance of two important multimodal related tasks including image retrieval and visual question answering are improved,which proved the practical value of this thesis.
Keywords/Search Tags:Multimodal Learning, Image Recognition, Image Retrieval, Visual Question Answering, Deep Convolutional Neural Network, Deep Neural Tensor Network, Attention Model
PDF Full Text Request
Related items