Video question answering task require models that need to simultaneously understand the rich information in videos and the semantic information of natural language questions.Building benchmarks to systemically analyze different capabilities of video question answering models is challenging yet crucial.Existing benchmarks often use non-compositional simple questions and suffer from strong language biases,making it difficult to diagnose model weaknesses incisively.A recent benchmark poses a promising paradigm to generate QA pairs automatically from preannotated scene graphs,enabling it to measure diverse reasoning abilities with granular control.However,its questions have limitations in reasoning about the finegrained semantics in videos as such information is absent in its scene graphs.There are three difficulties in the video domain.(1)The video scene graph lacks fine granularity,the current video scene graphs are usually labelled with only a small number of common object categories and relationship categories,while the relevant attributes of the objects themselves are lacking,making it difficult to localize.(2)It’s difficult for video questioning answering datasets to be both large-scale and finegrained.It is found that current large-scale video question answering datasets tend to have simple questions and strong language bias,while video question answering datasets with fine-grained annotation are difficult to achieve a large scale.(3)Low utilization of external information related to video question answering.The video question answering dataset contains less video-related or question-related external information,which makes it difficult to use external information to improve the performance of the video question answering model.To solve the above three difficulties,this paper proposes the following three approaches.To address the problem of lack of fine-grained video scene graph,this paper proposes a method to build a fine-grained video scene graph.By designing a hierarchical attribute classification system,designing and executing an annotation scheme and process,rich object categories,attribute categories,relationship categories and action categories are annotated,and a rich and diverse fine-grained video scene graph is constructed.To address the problem of it’s difficult to combine both large-scale and finegrained video question and answer datasets.A method to automatically generate a large-scale video question answering dataset for fine-grained spatio-temporal reasoning is proposed.Based on the above fine-grained video scene graph,a rich question templates are designed and the question templates are populated by traversing the elements of the video scene graph.The final result is 1.4B unbalanced QA pairs and 13 M balanced QA pairs,which is an order of magnitude lager than the current video question answering dataset with the same number of videos.To address the problem of low utilization of external information in video question answering.A scene-graph-assisted approach and A reference frame assisted approach are proposed which enhance the model performance on the textual and visual aspects of the model input respectively.In the textual aspect,a scene-graphassisted strategy is used to introduce scene-graph information by counting highfrequency words in the vocabulary of scene-graph and splicing them at the end of the question by means of statistics.On the visual aspect,a reference frame assisted strategy is used to guide the model to understand the visual content related to the question by injecting reference frames at different stages.In order to verify the effectiveness of the proposed method,relevant data are statistically measured on the video scene graph and the video question answering dataset.The fin-grained and richness of the video scene graph and the diversity and complexity of the video question answering dataset are fully verified.Adequate experiments are also conducted on the video question answering dataset to demonstrate the quality and difficulty of the video question dataset and to verify the effectiveness of the strategy of introducing external information.Finally,a large-scale video question answering benchmark evaluation platform for fine-grained spatiotemporal reasoning is designed and implemented based on the large-scale video question answering dataset. |