Font Size: a A A

Semantically Coherent Cross-Modal Correlation Learning And Information Retrieval

Posted on:2016-12-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y HuaFull Text:PDF
GTID:1108330482460427Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the popularization of Internet, information retrieval is in urgent needs with the proliferation of multimedia data and diversified user demands. The large scale Internet multimedia data have complicated semantic structures and diversified contents. On the other hand, multimedia data, such as texts, images and videos, are from heterogeneous modalities, and the relations between them are complicated. It is a challenging problem of multimedia data analysis and retrieval.Traditional text-based retrieval techniques compare the similarities of tex-tual query and the surrounding texts of web multimedia data, while they suffer from the mismatch between textual descriptions and web multimedia contents. Annotation-based retrieval techniques rely on an intermediate risky semantic annotation process, which usually trains classifiers for each semantic category. The accuracy of classifiers is restricted by the semantic gap between low-level features and high-level semantics. In addition, there are complicated relations among semantic categories. Thus it is hard for annotation-based methods to facilitate effective multimedia retrieval. Recently, researchers study the corre-lations of heterogeneous modalities, which is a goal-oriented solution for cross-modal retrieval. However, existing methods lack the adaptations to diversified multimedia contents and complicated semantic relations existed in abundant multimedia data, thus they cannot well handle the heterogeneous feature spaces of heterogeneous modalities. In this paper, we address the information retrieval problem following the routine of semantically coherent cross-modal correlation learning.As the first contribution, semantic-instructed visual attention is investi- gated and the model for extracting salient regions is constructed. Visual infor-mation is more redundant than textual information regarding to expressions of high-level semantics. Human visual system can extract the prime visual infor-mation with selective visual attention. Visual attention influenced by example images and predefined targets are widely studied in both cognitive and com-puter vision fields. Nevertheless, semantics, known to be related high-level human perception, have great influence on top-down attention process. In this paper, we collect the fixations with eye-movement tracking when subjects are semantically instructed to view videos. The fixation behaviour analysis shows that the process of semantic-instructed attention can be explained with long-term memory and short-term memory. Inspired by this finding, we propose a memory-guided probabilistic saliency detection model, which combines the top-down and bottom-up modules for semantic-instructed saliency. Experimen-tal results show that our model achieves significant improvements in predicting semantic-instructed visual salient regions.The second contribution is semantically coherent cross-modal correlation learning methods. Inspired by dimensionality reduction and metric learning with single modal data, a goal-oriented solution for cross-modal retrieval is to transform the heterogeneous modalities into measurable low-dimensional rep-resentations. However, the complicated semantic relations between multime-dia data are usually simplified as one-to-one correspondence and side informa-tion in single modality by existing correlation learning methods. And global projections that previous studies employed cannot appropriately adapt to di-verse multimedia contents. In the paper, we first propose cross-modal large margin metric learning with category-level semantic relevance. The distances of category-level relevance among cross-modal data are optimized based on a regularized learning framework. Second, since the complicated semantic rela-tions could be described in a hierarchical way, we study semantically coherent retrieval, where documents from different modalities should be ranked by the multi-level semantic relevance to the query. By jointly modeling of content and ontology similarities, we build an adaptive semantic hierarchy to measure multi-level semantic relevance. To deal with the complicated semantic relations and diversified multimedia contents, we propose localized correlation learn-ing methods with two ways for aggregating the localized projections by proba-bilistic membership functions. We optimize the well-defined structure risk ob-jective function that involves semantic coherent measurement, local projection consistency and the complexity penalty of local projections. With the learned local linear projections and probabilistic membership functions, the distances of cross-modal data reflect their relevance on the semantic hierarchy. Exten-sive experiments on widely used NUS-WIDE and ICML-Challenge datasets demonstrate that our proposed methods outperform state-of-the-art methods, and achieve better adaptation to the multi-level semantic relation and content divergence.The third contribution is semantically visual feature learning of cross-modal data. Most of existing research efforts are focused on building correlation learning models on hand-crafted features for visual and textual modalities. They lack the ability to capture the meaningful patterns from the complicated visual information, and are not able to identify the true correlation between modalities during feature learning process. Recently, deep network, due to outstanding performances in feature learning, has been attracting a lot of researchers’atten-tion. In this paper, we propose a novel cross-modal correlation learning method with well-designed feature learning on visual modality. Similar to CNN, we build a deep architecture with stacked convolutional layer, non-linear layer, pooling layer and fully connected layer. A novel cross-modal correlation layer with a linear projection is then added on the top of the deep architecture by max-imizing the semantic consistency with large margin principle. All the parame-ters of feature representation and correlation learning are jointly optimized with stochastic gradient descent. Experimental results on widely used NUS-WIDE dataset show that our deep correlation model outperforms state-of-the-art cor-relation learning methods built on 6 hand-crafted visual features for image-text retrieval.The forth contribution is a semantically cross-modal retrieval framework. Given large-scale Internet log data for training, we compare and integrate three typical methods (SVM-based, CCA-based, PAMIR) to measure the relevance of cross-modal data with concept-level visual features. In SVM-based approach, the relevance of the image and the textual query is scored using an on-line trained SVM classifier for the query. With CCA, the correlations between images and texts are maximized by learning a pair of linear transformations. PAMIR formalizes the retrieval task as a ranking problem and introduces a learning procedure to optimize a ranking-related criterion by projecting the im-ages to the textual space. By using the concept-level visual features obtained with CNN, our output aggregation system achieves a promising performance at MSR-Bing Image Retrieval Challenge @ ICME 2014.
Keywords/Search Tags:Information retrieval, Cross-modal correlation learning, Complicated semantics modeling, Model aggregation, Structure learning, Multimedia content analysis
PDF Full Text Request
Related items