| As two main media of information representation and storage,image and text play a very important role in daily life and production.In recent years,with the development of computer vision and natural language,people’s demand for machine intelligence has gradually increased,especially for the collaborative understanding and reasoning ability of image and text data,which has greatly promoted the research interest of many researchers at home and abroad.As a fundamental and key task in the field of multimodal reasoning and Turing testing,visual Q&A not only effectively establishes the internal logical relationship between vision and language and promotes multimodal reasoning modeling,but also derives and promotes the development of many downstream practical applications,such as human-computer interaction and automatic driving,so it has extensive and far-reaching research significance.Visual Question Answering(VQA)is fundamentally compositional in nature,and many questions are simply answered by decomposing them into modular sub-problems.Despite their promising performance.existing designed approaches still exhibit the following fundamental limitations:1)they rely on vulnerable off-the-shelf language parsers or expert policies which are not specially designed for the language and vision task,and the network layout is guided by these additional SUPERvision rather than learning from the input data.Therefore,they lack the adaptability to diverse vision-semantic distributions in the real-world settings and hence may be invalid.2)Beyond that,the devised composable function-speciffc modules in these methods are restricted to reasoning the simple scene from synthesized datasets with low variability,such as CLEVR,which might not be well adapted to more complicated visual scene.We argue that,the key to tackling aforementioned issues is to unify the compact network structure and powerful modules into a dynamic data-driven setting.To tackle this problem,we propose a Semantic-aware modular capsule Routing framework,termed as SUPER,to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction.Particularly,five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network,and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated.We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets,as well as the parametricefficient advantage.It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA.Instead,we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA. |