In today’s information age,the universal access of internet and mobile terminals has been spawning massive multimedia data.Image,with its intuitive and informative characteristics,has become the the mainstream of internet information dissemination.Extracting high-level representation from image data has long been concerned in computer vision and is one of the fundamental problems in artificial intelligence.Over the last decades,benefiting from the prosperity of deep learning,the research of image recognition and retrieval has achieved unprecedented breakthroughs and plays an important role in a wide range of commercial applications.Handdrawn input has been widely accetped with the rapid development of touch screen technology,and freehand sketching has incomparable convenience advantages in human-computer interaction.Sketch-based image retrieval has strong application prospects and has attracted increasing attention in broad research fields.The mainstream deep learning methods rely on abundant image data and rich annotations to train fully supervised models,and remarkable performance is achieved on ideal benchmark datasets.However,in practical applications,many uncertainties exist in various aspects of model training and deployment,including data scarce due to long tail distribution,semantic gap among massive object categories,asymmetric representation between modalities,etc.These practical issues pose a severe test to the practicability of deep models.In particular,sketch-based image retrieval is a cross-modal learning task,which faces more complex challenges when studied in open world.The complex uncertainties in open world are analyzed and summarized as three major meachine learning paradigms,namely weakly supervised learning,zero-shot learning,and semi-supervised learning.The sketch-based image retrieval task can be decomposed into several key recognition sub-problems,including object detection,semantic segmentation,cross-modal retrieval,etc.Each sub-problem is studied under its typical open-world conditions,and a series of algorithms and solutions are proposed on weakly supervised object detection,scene sketch semantic segmentation,and zero-shot sketch-based image retrieval.The existing sketch-based image retrieval models are mostly implemented at the single-instance level,leading to limited applications.Therefore,the generalized scene-level retrieval is further explored.For the first time,a scene-level sketch-based image retrieval mode is proposed under openworld conditions by incorporating the above achievements.The main contents and contributions are summarized as follows:(1)A subimage-based weakly supervised object detection model is proposed.The model is built upon a novel segmentation-aggregation learning framework.It introduces the normalized cut criterion from graph theory to split each input image into several subimages,which are supposed to contain one dominant object,thereby helping distinguish clustered objects.During subimage results aggregation,the contextual perception is regularized by the devised dynamic proposal dropping strategy to improve detection integrity.Experiments on the PASCAL VOC2007 benchmark show that the proposed model surpasses all existing competitors for both detection and localization tasks and effectively addresses two kinds of long-standing difficult scenarios.(2)A scene sketch semantic segmentation model with enhanced detail perception is proposed.In view of the lack of relevant datasets,two largescale scene-sketch datasets with real hand-drawing styles are contributed first.Considering the sparsity of sketch domain,fine-grained local representation is extracted from the initial layers of deep model to supplement local details to high-level semantics,and the stroke mask is integrated into end-to-end training to guide stroke-level segmentation.In addition,a sketch-specific segmentation post-processing algorithm is proposed based on the local continuity of strokes.Experiments on public SketchyScene and the contributed SKY-Scene and TUB-Scene datasets show that the proposed model has absolute advantages over competitors under multi-dimensional evaluations and significantly improves the precision of stroke-level semantic segmentation.(3)A semi-transductive zero-shot learning paradigm is proposed to realize more generalized instance-level sketch-based image retrieval.In sketch-based image retrieval,the data of natural image modality is easily accessible and releatively abundant,while the freehand sketches of sketch modality are much scare.It is reasonable to introduce transductive semisupervised learning into image modality,so that abundant unlabeled images can be utilizing to explore the model’s potential adaption to unseen classes.The learned knowledge is transfered from image modality to sketch modality through the proposed discriminative semantic alignment and distribution calibration modules.Extensive experiments on SketchyExtended and TU-Berlin-Extended show that the proposed method significantly advances the state-of-the-art zero-shot retrieval performance.(4)A conditional graph autoencoder structure is designed to realize scene-level sketch-based image retrieval.The model integrates object extractors from both image and sketch domains to extract regions of interests,and possible foreground objects are applied to convert images into layout graphs according to scene layouts.The encoder is a mixed graph convolutional network with parallel static and dynamic branches,which encodes structural and visual representation with global contextual information.The variant decoder is constructed as a weakly supervised multi-label classifier,which maintains nodes’ semantic representation through reconstruction consistency.The model requires only pairing information to realize scene-level retrieval without detailed instance-level annotations.Experiments on SketchyScene datasets show that the proposed method is far superior to existing state-of-the-art solutions,including those who have adopted more training supervision. |