| As a crucial task in video understanding,constructing social relationship networks can not only explore the potential semantic knowledge in video content but also help AI better understand human behaviors and emotions in the video.Many studies on knowledge mining of interpersonal relationships still focus on static images,lacking attention to temporal knowledge and other important modalities.Although humans can easily identify or infer social relationships between characters through various comprehensive clues,such as appearance,interaction,dialogue,clothing style,and background.Automatically capturing social relationships between characters is still a challenging task for AI,including how to effectively model the spatiotemporal structure and semantic information of videos,integrate multiple feature clues,and create scalable social network construction models.To this end,this article studies and implements the following contents:Firstly,we propose a multi-cue social network construction method based on multi-teacher knowledge distillation(McSRE)to extract social relationships in unconstrained scene videos.This method uses multiple teacher models and feature-based distillation methods to mine multiple clues from videos.Then,we design a method that combines multiple clue features and temporal features,and construct an attention-based temporal cue graph(ATCG).Under this method,the knowledge of multiple teacher models is transferred to multiple simple student models for model compression.Experiments on the ViSR and MovieGraphs datasets show that the McSRE model can achieve results close to or even surpass the SOTA methods with compressed models.Secondly,we propose a multi-cue video social relationship network construction method based on feature aggregation(RCRV)to aggregate meaningful contextual features that are important for identifying social relationships.We propose a novel global-local VLAD(GL-VLAD)module,using different scales of convolution to enlarge different receptive fields and extract the global and local information of features in the video.In addition,we propose a Multimodal Fusion Graph(MFG)to focus on the knowledge of different modalities,which can represent the general features in multi-modal video scenes.Thirdly,in combination with the big data analysis platform(BDAP),we design and develop basic functions and a video relationship generation module,which provides the platform with the ability to process and analyze video data and visualize it,allowing users to intuitively experience the charm of video social network construction. |