Font Size: a A A

Research On Image-text Retrieval Method Based On Deep Learning

Posted on:2021-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2518306548494084Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In this era of rapid development of computers and communications,people are exposed to more and more multimedia information such as text,video,audio,including images.Through the Internet,people are step by step to achieve global multimedia information sharing.Users' query of multimedia information has also become more and more common.A variety of new application requirements have followed.Cross-media retrieval technology refers to a multimedia retrieval method that can flexibly span between different modalities,that is,to retrieve samples of other modalities related to it through an instance of one modal.Such search results are rich in content and can present query objects to users more three-dimensionally.This paper focuses on cross-retrieval between image and text modalities.The deep learning model is used to extract the feature expressions of images and texts in the data set,and the two are mapped into a high-dimensional subspace.The similarity between the two modal samples is measured according to the distance in the subspace to complete the retrieval.This paper proposes a multi-level feature extraction method and a bi-semantic space construction method.In the feature extraction stage,image and text features that are conducive to fusion are extracted.In the feature fusion stage,the real semantic space and transformed semantic space are constructed for each modal for comprehensive retrieval,which effectively improves the retrieval efficiency.The work and research results of this paper mainly include the following aspects:(1)Aiming at the problem of semantic alignment in image retrieval problem,this paper improves the feature extraction part of the existing retrieval model,and proposes a multi-level key semantic information extraction method.The retrieval method is mainly composed of three modules: The first module adds dilated convolution to the VGG network and Text-CNN to obtain multi-level features of images and text.The second module is to achieve semantic alignment through feature selection and combination through attention mechanism and outer product.The third module is to fuse the two modalities and map them to a common subspace for retrieval.(2)This paper proposes a dual semantic space retrieval model.In current feature fusion networks,the objective functions include classification tasks and fusion tasks.Since the feature space of each modality needs to be classified,the function distribution of other modalities must also be considered,which will lead to the loss of accuracy and the failure to fit the function distribution in the finally learned feature space.This will affect the cross-modal search results.This paper first builds a real semantic space,that is,a complete semantic space that has a good effect on identifying single-modal labels.Then,a transformation semantic space is constructed.The transformation semantic space is a bridge between two modal real semantic spaces,with its own modal semantics and function distribution of the modalities to be retrieved.The two modalities compare the transformed spatial feature of the modal with the real spatial feature of the other modal,calculate the similarity,synthesize the results,and complete the retrieval.
Keywords/Search Tags:Deep learning, Convolutional neural network, Cross-media retrieval, Feature extraction, Feature fusion
PDF Full Text Request
Related items