Font Size: a A A

Research On Object-scene Related Visual Features And Cross-modal Reciprocal Neighbors Based Image-sentence Retrieval

Posted on:2020-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:M M JiangFull Text:PDF
GTID:2428330602950203Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The task of image-sentence retrieval is popular across two major research fields: Computer vision and natural language processing.In recent years,these two research fields have achieved significant progresses in visual understanding and text semantic analysis.Therefore,the task of image-sentence retrieval has attracted more and more attention and made significant progress recently.However,there are still many challenges for the imagesentence retrieval.For example,the image features usually fail to comprehensively reflect the information contained in the image ignoring the scene context information.Besides,the cross-modal neighboring relationship between the visual and semantic aspects is asymmetric during cross-modal retrieval.This paper has conducted systematical research on above issues.The main research achievements are summarized as follows:(1)We propose a visual-semantic alignment algorithm based on object-related and scenerelated deep visual feature fusion.In the traditional image-sentence retrieval methods,the visual feature is extracted by CNN based object classifiers trained on ImageNet dataset.Thus,the visual feature contains rich object information but lacks scene context information.Therefore,in order to ensure that the visual feature contains multiple complementary semantics information,we fuse the object-based and the scene-based convolutional neural networks to extract the deep visual features.Then,the sentence is encoded by the long short term memory network,and the corresponding semantic feature representation is extracted.Finally,two mapping matrices are used to map the visual features and the semantic features into the common cross-modal embedding space,which is more suitable for image-sentence retrieval.We evaluate the performance of the image-sentence alignment model on the MSCOCO dataset and the Flickr30 k dataset.The results show that the proposed cross-modal alignment algorithm has achieved better results under various measurements than the previous VSE++ model,which proves that our visual features contain more abundant semantic information.In other words,our model achieves better visual-semantic alignment and constructs a better cross-modal embedding space.(2)We propose a cross-modal reciprocal nearest neighbor based re-ranking method for image-sentence retrieval.Since caption contains highly concentrated semantic information and the visual features contain more original and abundant semantic information,there is asymmetry of the neighboring relationship in cross-modal retrieval,which reduces accuracy of retrieval.We use a cross-modal reciprocal nearest neighbor based method to re-rank the initial search list in the MVSE++ space,to lessen the asymmetry of the nearest neighbor relationship in the cross-modal embedding space,which sufficiently reduce the errors for search accuracy and improve the performance of the image-sentence retrieval.The experimental results on MSCOCO dataset and the Flickr30 k dataset show that the proposed re-rank method based on cross-modal reciprocal nearest neighbor significantly improves the final image-sentence retrieval results,and also effectively alleviates the asymmetry of the search relationship in the cross-modal embedding space.
Keywords/Search Tags:Cross-modal retrieval, embedding space, feature fusion, reciprocal nearest neighbor, re-rank
PDF Full Text Request
Related items