Font Size: a A A

Cross-Modal Image Clustering

Posted on:2015-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ZhaoFull Text:PDF
GTID:2308330503975094Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Collaborative image tagging systems, such as Flickr, have now become very popular with people who want to share and search images. Such collaborative image tagging approach is meaningful because it can provide sufficient keywords that can give image seekers an intuitive way to search for images of interest. In a collaborative image tagging system, people can tag the images according to their social or cultural backgrounds, personal expertise and perception. Such collaboratively tagged images are called weakly-tagged images because of the uncertain relationships between the semantics of images and their social tags. With the exponential growth of such weakly-tagged images, it has become increasingly important to have mechanisms that can support more effective organization, visualization, search, and summarization of large-scale social images.To tackle problems mentioned above, image clustering has emerged as an important application. Most of the traditional image clustering algorithms are based on the low-level visual features of the images. The major drawback in this approach is that the state-of-the-art visual features are unable to represent the image content on a semantic level. As a result, image clustering suffers from the semantic gap between visual features and high-level semantic concepts. A low-tech and a naive solution adopted by search engines to overcome this problem to a certain extent has been to treat image clustering as a text clustering problem. Web images are represented using textual features in terms of the surrounding texts or social tags. But, since images are not actually text documents, this approach is hardly a solution to the problem at hand.Recently, some approaches incorporating both visual and textual features together are proposed. We call this kind of approach as cross-modal image clustering. In our opinion, there are two key issues for cross-modal image clustering. First, how to get enough semantic information from keywords. Second, how to incorporate the semantic and visual information together.We leverage a lexical database Word Net to capture the semantic similarity between keywords. Word Net is a large lexical database of English. Word Net’s structure makes it a useful tool for computational linguistics and natural language processing. There are various Word Net-based measures we can use, such as the Hirst-St.Onge measure, the Lin measure, and Context Vectors measure, etc. These automatic computational methods can assign relatedness values or scores to pairs of concepts. With the relatedness scores, our system can easily find out that “pencil” is more related to “paper” than it is to “boat”.We can get a semantic similarity matrix of images by a Word Net-based semantic measure, and a visual similarity matrix from low-level features. These two similarity matrices provide us different grouping information from diverse viewpoint. To incorporate them together, we perform the Canonical Correlation Analysis(CCA) algorithm to determine the optimal projection directions by maximizing the correlations between the semantic and visual similarity space. After the semantic and visual similarity are projected onto their most correlation spaces, a cross-modal similarity matrix is computed by directly combining the two-modal information. In simple words, our approach tries to create a common space where the semantic and visual information can talk to each other.
Keywords/Search Tags:Word Net, Image clustering, Canonical Correlation Analysis(CCA), Computational Linguistics
PDF Full Text Request
Related items