The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. The purpose of this paper is to obtain a more effective model on cross-modal through structure modification and improved algorithms. The Content of this paper is mainly focused on the "image-text" cross-modal retrieval problems.The key point of cross-modal retrieval is how to modeling the correspondence and relationship between data from different modalities and establish a common space for joint feature representation. Under ideal conditions, people could establish a semantic space for the representations from different modal which could be seen as a space of feature confusion. The problem of how to establish the space mentioned above is called multi-modal feature confusion in this paper.The starting point of this paper is to take the advantages of deep learning technology to reform the classical cross-modal retrieval model so as to improve retrieval performance. There are two main contributions of this paper: one is that build a deep model for cross-modal retrieval using the existing "shallow" model which is called correspondence Auto-encoder as the basic element. Another contribution of this paper is to propose a new algorithm of feature confusion, and based on it, we proposed a novel cross-modal retrieval model which is composed by correspondence auto-encoder and canonical correlation analysis.A cross-modal retrieval system which is based on the new Corr-AE+CCA model also has been established at the end of the paper. |