Font Size: a A A

Cross-modal Retrieval Using Deep Neural Network

Posted on:2018-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:X S MengFull Text:PDF
GTID:2348330512994714Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the use of deep learning and the intensive research of the multi-modality,the Question Answering(QA)system has been extended from the traditional text QA into visual QA(VQA)combining pictures and it has become a hot research topic in computer vision and natural language processing.But the existing method in processing of VQA,often used to extract the image features and text characteristics of the two information fusion to predict the answer.However,such methods generally ignore the spatial distribution of images,and can not reasonably deal with the relationship between the images' spatial distribution and the text.In this paper,On the basis of image and text of the VQA cross-modal study,we finally put forward a new model using Spatial Discrete Cosine Transform Hash(Spatial-DCTHash)Network which combines the question and image features to predict the answer.At the same time,this paper studies the cross-modal sign retrieval problem under the background of cross-modal study of VQA.After combining with the deep learning methods which are widely used in computer vision,this paper proposes one kind of brand retrieval algorithm which can deal with multi-angle and multi-modal information.This paper mainly studies the cross-modal QA and retrieval algorithm,the work concerned are as follows:1.It is proposed to use the method of full convolution to complete the extraction of the image with spatial distribution characteristics and without adding the network parametersAt the same time,the Spatial-DCTHash dynamic parameter network is used to dynamically combine the problem characteristics and image features,so that the prediction of the answer can take full account of the local spatial information of the image.2.This paper presented the multi-angle and multi-information signboard dataset(total about 2400 shops,about 23000 pictures),which contains multiple pictures of each store,store's GPS information,store name,street name and other ancillary information.3.After combining with the signs of the image characteristics and text features,this new CMR-Net model uses of multi-modal information to identify the brand signs,therefore,it can complete brand search tasks in different environments,and measures up to commercial standard.In the end,this paper experiments on visual cues and public datasets(MSCOCO-VQA,COCOqa),and the results show that this method can achieve higher accuracy than the previous one.At the same time,this paper also compares this algorithm with some common algorithms in its own sign data and it turns out that the cross-modal sign retrieval model has better result.
Keywords/Search Tags:VQA, signboard retrieval, spatial DCT hash, cross-modal, deep learning
PDF Full Text Request
Related items