| With the popularity of smart phones, digital cameras and other digital devices and the fast development of Internet technologies, the number of images on the World Wide Web increase dramatically. It therefore turns out a great challenge to represent and understand the visual content of the images. Due to the "Semantic Gap" problem, traditional unsupervised visual feature extraction methods could not provide sufficient information for the image understanding tasks. Aiming at this problem, this thesis fo-cuses on the supervised visual features and tries to incorporate the high-level semantics into the visual content. The thesis presents our works on following aspects.In the research of interest point level visual features, we propose a simple and effi-cient approach to refine the local descriptors for vector quantization by embedding se-mantic information. The original local descriptors are projected into a new feature space using a sequence of supervised basis. The transferred descriptors are further quantized and aggregated to the visual vocabulary. Experimental comparisons on object catego-rization dataset with several state-of-the-art approaches demonstrate the effectiveness of our proposed approach.Traditionally, the Bag-of-Visual-Word model describes an image as a histogram of the occurrence rate of codebook vocabulary. We propose to incorporate the spatial correlogram between codewords to approximate the local geometric information. This works by augmenting the traditional vocabulary histogram with the distance distribution of pairwise interest regions. We also combine this correlogram representation with se-mantic feature selection. Experimental results show that correlogram representation can outperform the histogram scheme for Bag-of-Visual-Word model, and the combination with semantic feature selection improves the performance for categorization.In the research of regional attribute features, we introduces a novel image repre-sentation where an image is described based on the response maps of object part filters. Our proposed representation, called Hybrid-Parts, is generated by pooling the response maps of the hybrid filters. Contrast to previous approaches that adopted object-level detections as feature inputs, we harness filter responses of object parts, which enable a richer and finer-grained representation. Through experiments on several scene recogni-tion benchmarks, we demonstrate that Hybrid-Parts outperforms recent state-of-the-arts, and combining it with standard low-level features such as the GIST descriptor can lead to further improvements.In the research of global attribute features, We introduce a novel framework based on constructing context spaces of global attributes. Both the pairwise attributes cor-relations and the context space for representing attribute themes trained from a set of attribute are presented. Experiments are presented on TRECVid benchmark and the comparisons with several state-of-the-art approaches show the effectiveness of proposed framework. |