In the era of Internet,peoples generate massive multi-media data by using various application daily.It contains a variety of multi-media data,such as voice,short video,pictures and text.This has brought the needs for diverse retrieval,such as image-text retrieval,audio-video retrieval et.al.In order to meet people's diverse needs by providing better retrieval services,researchers focus the cross-modal retrieval theory,methods and practice.It can be seen that the cross-modal retrieval methods have a wide range of application scenarios and research significance.How to mine the effective information from multi-modal data is an important issue in the field of multimodal data research.Our researches mainly include the following three aspects of work.(1)To solve the problem that the existing shallow network structure cannot be used to better model the inter-modal high-layer semantic correlation,we propose cross-modal image-text retrieval based on stack bimodal autoencoder.The deep-level network structure is adopted to mine the inter-modal high-level semantic correlation.And we introduce layer-wise pre-training method to enhanced learning ability of the model.Considering the data of different modal are heterogeneous at the bottom,it contains rich semantic association information on the top-layer.In our paper,the idea of deep-layer network structure is introduced to further improve the accuracy of cross-modal image-text retrieval.Extensive experiments on three popular cross-modal datasets demonstrate that the improved model outperforms the reference best benchmark model regarding accuracy.The mean average precision is improved by 4%,8.6% and 3.7% on three datasets,respectively.(2)Aiming at the problem that inter-modal correlation is ignored in the most of the current cross-modal image-text retrieval methods first stage,we propose a method of hybrid deep neural network that integrates multiple neural networks.The method is consisting of the multi-modal deep belief network,stacked Autoencoder and correspondence Autoencoder.By combining multiple types of neural networks,we established multi-level corresponding correlated relation to mine fine-grained features and multi-level correlation of multi-modal data.Extensive experiments on three popular cross-modal datasets demonstrate that the improved model outperforms the reference best benchmark model regarding accuracy.The mean average precision is improved by 5.7%,13.2% and 5.2% on three datasets,respectively.(3)For the problem that traditional image process methods have poor generalization performance and they are not suitable for processing large-scale data,we adopt VGGNet to optimize the cross-modal image-text retrieval model.Because of excellent performance of deep convolutional neural network in image data processing,on the basis of the above two models,we select the typical deep convolutional neural network——VGGNet extracts the image features in the experimental datasets.We completes the contrast experiment with the traditional image processing method in the multiple cross-modal image-text retrieval datasets. |