Font Size: a A A

Research On Multimodal Training For Image And Text Retrieval

Posted on:2017-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:G L ZhangFull Text:PDF
GTID:2348330518993367Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of network and comput-er technology,the rise of data sharing,people are faced with massive growth of text and image data.How effectively to get data which we are interested has become a prominent problem.The technic of image re-trieval includes Text-Based Image Retrieval(TBIR)and Content-Based Image Retrieval(CBIR).TBIR requires manual annotations for image data to get correct tags and contents.Social data given by users are la-beled data which makes it possible to obtain a large number of semantic annotation data.Howere,such kind of text data are noisy and incomplete.With the use of deep learning,CBIR has been greatly developed,but the retrieval precision of CBIR is unsatisfactory because of the semantic gap between the visual features and semantic concept.In addition,it is diffi-cult for users to find a similar picture with the query one.Because of these problems,achieving cross-modal search with direct correlation between text and image is a core underlying demand.Based on this demand,combining with the mature model of text and image,this thesis dedicates to the research of multimodal model with the fusion of text and image features.The main work of this thesis are as follows:1.With Gaussian Restricted Boltzmann Machine as input model of image features and Replicated Softmax Model as input model of text fea-tures,we complete modeling relationship between image and text through the join representation,and the retrieval experiment on Flickr8k dataset.This model can be able to rebuild the feature of data,so it is faster than the model which needs to calculate the similarity between image features and text features in large datasets through indexing.2.We use dependency tree to analyse the structure of sentences and word embedding to represent the features of words.According to the principle of dependency tree,we implement Recursion Netural Network to model the relation between words.Finaly,we contruct the ranking cost function to train the parameters in the model.Retrieval experiment show that this model is better than Multimodal Deep Boltzmann Model(M-DBM)on retrieval accuracy.3.Because a single,fixed-sized representation is unable to descript the complex image or sentence,we propose a method to learn relation between image and text in a fine-gained level.Convolution Netural Net-work(CNN)is used to extract feature as fine-gained feature of image from objects after Regions with CNN locating the object positions in the image.Word embedding,as the fine-gained feature of text,makes use of Bidirectional Recurrent Neural Network to contain context information.With the fine-gained feature of text and image,we define a ranking cost function to train the multimodal model.Due to the long training time,we implement the model on Caffe.The performance in retrieval experiment show that the model with fine-gained feature is better than other models on retrieval accuracy.4.From another point of view,we set up a multimodal model to re-trieve image with natural language model.In Log-Bilinear Language(LBL),transformed image feature is as a bias to affect the predicted probability of the next word.Although the speed of retrival is slow,the multimodal LBL is better than M-DBM on retrieval accuracy.In conclusion,through above work,we establish relations between visual features and semantic concepts and complete bidirectional retrieval of images and natural language descriptions.
Keywords/Search Tags:multimodal, retrieval, deep learning
PDF Full Text Request
Related items