| Image representation is the basis of an image.Different types of representations display the image from different aspects.For example,pixel-based representations directly represent each pixel with pixel color,texture or brightness,providing the low-level features of images,while region-based representations assign each region a meaningful label,focusing on the high-level semantics.In a way,image processing and computer vision can be regarded as to extract desirable features and transform the raw images into different representations.Recent deep convolutional neural networks learn to tackle different vision tasks with the supervision in various forms of image representations.However,we found that the traditional image representations for dense semantic prediction tasks usually neglect the spatial relation of pixels,implicitly leading to the loss of structural and geometrical information.In this paper,we consider solving the above problem with a relation-based direction field representation.We transform the traditional image representations into the proposed form and thus force the network to directly learn the spatial relation of pixels,highlighting the structural and geometrical information in feature learning.To validate the effectiveness of the proposed relation-based direction field representation,we propose the concrete solutions and conduct experiments on two vision tasks,including object skeletonization and scene text detection,both of which are closely associated with the structural and geometrical properties in images.Skeleton is a structure-based object descriptor that reveals local symmetry as well as connectivity between object parts.Object skeletonization in natural images is challenging,owing to large variations in object appearance and scale,and the complexity of handling background clutter.Existing learning-based methods frame this task as a binary pixel classification problem,which is similar in spirit to learning-based edge detection,as well as to semantic segmentation methods.In this paper,we take full advantage of the relation-based direction field representation and thus propose a novel “skeleton context flux”,which maps each context point to a skeleton pixel,in the spirit of flux-based object skeletonization algorithms.The skeleton context flux has two major advantages over previous approaches.First,it encodes the relative position of skeletal pixels to semantically meaningful entities,such as the image points in their spatial context,and hence also the implied object boundaries.Second,since the skeleton context flux is a region-based direction field,it is better able to cope with object parts of large width.We then present a novel method named “Deep Flux” for accurately localizing the object skeleton.We evaluate the proposed method on five datasets for object skeletonization,consistently achieving superior performance over state-of-the-art methods at that time.Scene text detection is an important step in scene text reading.The main challenges lie on significantly varied sizes and aspect ratios,arbitrary orientations and shapes.Driven by recent progress in deep learning,impressive performances have been achieved for multi-oriented text detection.Yet,the performance drops dramatically in detecting curved texts due to the limited text representation(e.g.,horizontal bounding boxes,rotated rectangles,quadrilaterals,or binary masks).It is of great importance to detect curved texts,which are actually very common in natural scenes.In this paper,we also take full advantage of the relation-based direction field representation and thus propose a novel “text direction field”,pointing away from the nearest text boundary to each text point.It encodes not only the binary text mask but also the structural and geometrical information,which can be further used to separate adjacent text instances.We then present a novel method named “Text Field” for detecting arbitrary-shaped scene texts.Extensive experimental results show that the proposed method outperforms the state-of-the-art methods by a large margin on two curved text datasets at that time,and also achieves very competitive performance on two multi-oriented text datasets.Furthermore,the proposed method is robust in generalizing to unseen datasets. |