| With the rapid growth of image data in our real life,the computer vision-based image analysis has attracted the attention of many researchers in the field of artificial intelligence.Visual relationship understanding is an important branch of the image analysis,which aims to explore the relationship among the contents of the images.It is one of the basic technologies required in many practical scenarios,such as clothing recommendation,chatbot,and autonomous driving.The images that require the visual relationship understanding are various.According to the number of objects contained in the images,images can be divided into two types:the simple images and the complex images.The simple images usually contain a single object with a clean background,while the complex images tend to contain multiple objects,and their backgrounds are more complex.As the amount of information contained in different types of images is different,the specific requirements for visual relationship understanding are different.In particular,for the simple images,we usually focus on the inter-image visual relationship understanding.For example,understanding the compatibility relationship between the clothing images of the clothing recommendation task.As for the complex images,we tilt to the intra-image visual relationship understanding,including the understanding of the spatial and semantic relationships among the objects in the images.For example,the understanding of the spatial relationships among the objects in the image in the autonomous driving applications.Therefore,for images with different complexities,to adapt to the specific practical applications,it is necessary to analyze the visual relationship of images from different aspects.Towards this end,this work systematically focuses on the intensive study of the visual relationship understanding from the inter-image and intra-image aspects,based on the practical application tasks of inter-image compatibility relationship modeling and intra-image scene graph generation,respectively.Although existing works have made promising progress in these two tasks,there are still some problems in these works.In particular,existing works of interimage compatibility relationship modeling excessively rely on the labeled data,neglect the domain knowledge,and has poor interpretability,making them difficult to satisfy the requirements of the users.As for the existing works of intra-image scene graph generation,the distinguishing to the similar predicate classes of these works are insufficient,and some of these works are overly debiased on the head predicate predictions.These problems make the generated scene graph less informative and difficult to apply to the practical applications.To tackle the aforementioned problems,this work explores the visual relationship understanding from the following aspects:1)Knowledge-enhanced inter-image visual relationship understanding.This work explores the guidance of the probabilistic domain knowledge on the deep neural networks,and a probabilistic knowledge distillation model for inter-image compatibility relationship modeling is proposed.The proposed model automatically extracts and structures the massive domain knowledge and assign the knowledge confidences according to the cooccurrence probabilities,based on which the model can achieve the guidance and enhancement to the deep neural networks by the knowledge distillation framework.The proposed model is able to reduce the dependence on labeled data of the deep neural network and improve the performance of compatibility relationship modeling.In addition,this work explores the different guidance of the different domain knowledge on the deep neural networks.2)Prototype-guided interpretable inter-image visual relationship understanding.This work introduces a new way to enhance the model interpretability by mining the latent prototypes,and a prototype-guided interpretable scheme for inter-image compatibility relationship modeling is proposed.The proposed model first obtains the dimension-level interpretable semantic attribute representation of each clothing image by the attribute classifier.Then,the proposed model excavates the latent attribute interaction prototypes using the Nonnegative Matrix Factorization,based on which the model achieves the interpretation of the compatibility relationship and provides the alternative clothing items.Besides,the proposed model integrates the compatibility relationship modeling and the excavation of the attribute interaction prototypes by the Bayesian Personalized Ranking algorithm to make the two schemes promote each other.3)Divide-and-conquer strategy for intra-image visual relationship understanding.This work tackles the predicate prediction task by using the correlation among predicates and introduces a divide-and-conquer network for intra-image visual relationship understanding.The proposed model discovers the similarity relation among predicates through a patternpredicate correlation mining algorithm,and then divides the predicate prediction into multiple subtasks.The proposed model distinguishes the subtle differences among similar predicates by multiple specific predicate classifiers and use the Bayesian Personalized Ranking scheme to promote the pairwise distinguishing of the head.predicates and their similar tail predicates.The two schemes are able to facilitate the distinguishing of the subtle differences among similar predicates.4)Dual-biased complementary predictor for intra-image visual relationship understanding.This work investigates the strategy of integrating the advantages of the biased and unbiased intra-image scene graph generation models and proposes a dual-biased predicate predictor for intra-image scene graph generation.The proposed method devises a head-oriented soft regularization to take advantage of the superior prediction for head predicates in the biased predicate predictor to compensate for the poor head predicate prediction in the unbiased predicate predictor,which is able to alleviate their overly debiased problems and achieve a better trade-off between head and tail predicate predictions. |