| For a long time,visual object detection and understanding have been the core issue in artificial intelligence,which are also greatly demanded in the industry.Visual object detection mainly answers two questions of "where" and "what" of visual objects,that is,to locate the visual objects by bounding-boxes,and to classify the categories of objects.With accurate visual object detection results,visual object relationship understanding is to reveal the semantic relationship between visual objects.These relationships usually refer to the predicate verbs between visual objects,which form the subject-predicate-object triplet of<Visual Object A-Visual Relationship-Visual Object B>.This dissertation will focus on the above two problems.The former is the premise of the latter,and the latter is the road to high-level visual object understanding.In such a technical route,our research contents and contributions are listed as follows:(1)For visual object detection,this dissertation focuses on how to accurately iden-tify visual objects in a visual scene.To solve the imbalance between foreground and background samples in the visual scene,we propose a Sampling-Free mechanism that solves the imbalance problem by optimal bias initialization and adaptive guided loss,which avoids the laborious resampling and reweighting strategies.Experiments on mul-tiple datasets show that the Sampling-Free mechanism accelerates the model training,and effectively improves the accuracy of multiple visual object detection algorithms.(2)After accurately detecting visual objects,this dissertation further studies the semantic relationship understanding between visual objects.We creatively aim at the problem of visual relationship expression in a dynamic visual scene,and propose a video content-oriented visual-text relationship alignment method,CrossGraphAlign,for video content retrieval.In this method,the text and video are expressed as the text relation graph and multiple visual relation graphs respectively,with an attention mechanism to match these graphs,which makes it possible to retrieve specific video segments by us-ing text relationships.Experiments on several datasets show that the CrossGraphAlign method could effectively align visual relationships and text relationships,as well as greatly improve the recall of video content retrieval system. |