Font Size: a A A

Research On Feature Representation Model And Multiple Information Sources Fusion Methods In Image Retrieval

Posted on:2013-04-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:W T LuFull Text:PDF
GTID:1228330374999653Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Multimedia information plays an increasingly important role in human’s daily activities. However, how to efficiently and effectively retrieve images that satisfy the needs of web users in large multimedia databases is becoming more and more important and challenging. The research on representation model of image feature and the fusion methods of multimedia information sources in image retrieval is presented in this thesis.In general, image retrieval can be categorized into the following two different groups:text-based image retrieval (TBIR) and content-based image retrieval (CBIR). Most of the early text-based image retrieval systems often focus on utilizing text descriptions of images to search images via simply matching keywords, which are input by users. This kind of methods requires that each image must be labeled with its corresponding text information beforehand, which is still a complex problem to be solved and the qulity of label has a direct impact to the accuracy of image retrieval. Subsequently the content-based image retrieval reveals its importance and focuses on the contents of the images themselves, which directly extracts the low-level visual features of images and then indexes and retrieves images based on the extracted visual features.An important and detailed research is given firstly on representation model of image feature in this thesis. Image feature extraction, as a crucial step of image retrieval, attracts much more attention and is very useful in the subsequent procedures. Currently, image feature can be categorized into the low-level visual feature and the high-level semantic feature. Due to technical limitations, in image retrieval, we usually use the low-level visual feature to represent the high-level semantic concept. In general, the low-level visual feature often includes global feature and local feature. Compared with the global feature, most of the local features are invariant to the scale, rotation, translation, affine transformation, and illumination changes, which could achieve better performance. Among these local features, SIFT (Scale Invariant Feature Transform) has been widely used to perform image retrieval task, especially combining the SIFT and the TF-IDF (Term Frequency-Inverse Document Frequency) techniques to form the classic bag-of-words model (BoW model). However, the basic BoW model has some limitations as it ignores he spatial information of visual words and only contains imited semantic relationships among them. In order to address these limitations, in this thesis, we firstly explore the spatial and semantic relationship between visual words, and propose a novel image representation model (i.e., bag-of-phrases (BoP)) to represent images at both the word-level and the phrase-level. Obviously, the spatial and semantic distinguishing power of image features can be enhanced via our proposed BoP model, and the BoP model is capable of handling the background clutter problem effectively.Another focus of our research is the fusion methods of multimedia information sources in image retrieval, i.e., how to combine other available information sources related to image in the web, such as text, video, audio,etc., to improve the performance of image retrieval. In the exploration of multiple information sources fusion methods, web image clustering/categorization, as the crucial step of image retrieval, has an important impact to the accuracy and performance of the subsequent image retrieval. Therefore, in this thesis, we firstly provide a comparative experimental study on the five classic and well-accepted clustering/classification methods in recent years, which include two single-model Clustering/Classification methods (Text-based method and Image-based method) and three multi-view learning methods (Feature Integration, Semantic Integration and Kernel Integration). From the comparison results, we observe that the performance of web image clustering/categorization using single-modal methods are relatively low; once the text and image data sources are integrated using multi-view learning methods, the performance can be dramatically improved. However, these three kinds of multi-view learning methods process each information source separately and then combine them together at either feature level, semantic level, or kernel level, all of which ignores the correlation and interaction between each information source. Therefore, in our research, we explore the feasibility of using text information as a "guidance" for image categorization by proposing two novel methods (Dynamic Weighting and Region-based Semantic Concept Integration), which achieve better performance comparing with the above five existing methods. In order to further improve our two proposed classification methods such that they could handle large scale datasets, we propose a novel multimedia information fusion framework in which these two proposed methods are seamlessly integrated by analyzing the special characteristics of different images. Specifically, the proposed framework can not only choose a suitable classification model for each testing image according to its special characteristics and consequently achieve better performance with relatively less computation time for large scale datasets, but also address the problem that the textual descriptions of a small portion of web images are missing. Besides exploring the interaction between images and texts, we further investigate the correlation between web images and their corresponding textual descriptions at semantic level, and then utilize the correlation to enrich the feature space where the supervised classification is performed, which could be called transfer learning. Our proposed cross-domain transfer learning method for utilizing web multimedia objects without true labels in performing supervised classification tasks. According to experiments, by transferring such correlation knowledge, our cross-domain transfer learning method can not only handle large scale web multimedia objects via effective multimedia information fusion, but also deal with the situation that one of the information sources of a small portion of web multimedia objects is missing.
Keywords/Search Tags:Image retrieval, Information Fusion, Image FeatureRepresentation, Transfer Learning, Bag-of-WordsModel
PDF Full Text Request
Related items