Font Size: a A A

Image Feature Understanding And Semantic Representation Based On Deep Learning

Posted on:2020-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:G J YinFull Text:PDF
GTID:1368330572979007Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The field of image feature understanding and semantic representation is an inter-disciplinary subject involving computer vision,natural language processing,machine learning and etc.It is also an important part of the intelligentization of computer vision and can be applied to lots of areas of life and industry.Image feature understanding and semantic representation is based on deep neural networks and the most important characteristic of deep neural network is iteration and progression layer-by-layer.Construct a network model from the basic visual feature map to high-level semantic description,and then obtain the natural language descrip-tion based on the semantic content of the image.Therefore the semantic information contained in the image is translated into natural language text.The key issues in the research of image feature understanding and semantic representation in this thesis are listed:(1)How to learn and understand the visual relationship between semantic regions in the image,and solve the problem that the targets cannot be effectively recognized in the complex visual relationship;(2)How to learn the natural language description gen-eration model for the target region,and solve the problem that the language description generated by a naive module is inaccurate and not rich;(3)How to integrate the se-mantic information of natural language to generate corresponding visual content,and solve the problem that semantics contained in natural language and generated image is inconsistent.In order to solve the above problems,the main contributions and innovations of this thesis include:1.Proposed a novel visual relationship recognition network framework com-bining spatial location,context information and appearance features.As for large-scale visual relationship recognition,this approach proposes a novel visual feature net-work structure,Spatiality-Context-Appearance module(SCA-M).The proposed SCA-M module adopts the relative spatial position and context information for more in-depth and comprehensive visual feature learning.Moreover,the proposed deROI pooling op-eration can achieve pools local object features to the corresponding area of global pred-icate features.In practice,the proposed deROI pooling can be considered as an inverse operation of the traditional ROI pooling,which is analogous to deconvolution versus convolution.Furthermore,to mitigate label ambiguity in large-scale datasets and en-hance recognizability,Intra-Hierarchical tree(IH-tree)is introduced to reformulate the visual relationship recognition problem to a multi-label recognition problem.2.Proposed a novel dense image captioning network based on context infor-mation and linguistic attribute loss.The proposed network contains a novel Contex-tual Feature Extractor(CFE)establishes a non-local similarity graph for feature inter-action between the target ROI and its neighboring ROIs based on their feature affinity and spatial nearness.The proposed CFE allows adaptive contextual information sharing from multiple adjacent ROIs(i.e.,global and neighbors)to interact with the target ROI.To reinforce the coarse-to-fine structure of description generation,we adopt coarse-level and fined-level linguistic attribute losses as the additional supervision respectively at the sequential LSTM cells.Without sequential restrictions from the ground-truth captions,such keywords or attributes are more recognizable by the content in the target ROI,and thus own a more stable discriminative power for the extraction of visual patterns.3.Proposed a novel photo-realistic text-to-image generation model that im-plicitly disentangles semantics to both fulfill the high-level semantic consistency and low-level semantic diversity.To address the deviated image generation caused by variations of descriptions,the proposed Semantics Disentangling Generative Adversar-ial Network(SD-GAND)distills the semantic commons from texts for image generation consistency and meanwhile retains the semantic diversities&details for fine-grained image generation.Specifically,the proposed SD-GAN uses a Siamese scheme with a pair of texts as input and trained with the contrastive loss.To some extent,the Siamese structure indeed distills the semantic commons from texts but meanwhile ignores the se-mantic diversities&details of these descriptions even from the same image.To maintain the semantic diversities from the texts,the detailed linguistic cues are supposed to be embedded into visual generation by reformulating the batch normalization layer within the generator,denoted as Semantic-Conditioned Batch Normalization(SCBN).The pro-posed SCBN enables the detailed and fine-grained linguistic embedding to manipulate the visual feature maps in the generative networks.
Keywords/Search Tags:Visual Relationship Recognition, Dense Image Captioning, Text-to-Image Generation, Generative Adversarial Networks, Image Captioning
PDF Full Text Request
Related items