| With the increasing popularity of the Internet and the rapid development of social networks,the information available online has evolved from pure text to multi-modal information that combines text and images.Processing this increasing amount of multi-modal information quickly has become a crucial task for researchers.Among the fundamental tasks in the multi-modal domain,text-image relation inference(TIRI)has garnered significant attention in recent years as an important part of web information processing.TIRI aims to identify the relationship between an image and its corresponding piece of text.Currently,almost all TIRI approaches rely only on the original text and images in the annotated samples,which greatly limits their applicability due to the limited availability of annotated TIRI data.To overcome this challenge,this thesis explores three research points from the perspective of introducing external data.Specifically,these research points are as follows:First,this thesis proposes a TIRI method that introduces auxiliary tasks to address the problem of limited labeled samples.Specifically,our proposed method introduces external relevant datasets and constructs a multi-modal multi-task joint learning framework to pass relation clues from the auxiliary task to the main TIRI task.Our method first utilizes BERT and ResNet to extract the base features as bi-modal sequences from text and images.Next,the full model is trained with fused textual and visual information in two stages iteratively and jointly:1)Train the auxiliary multi-modal task on its specific dataset;2)Train the main TIRI task with the original input and the prediction of the auxiliary task.Systematic experiments demonstrate the effectiveness of our proposed multi-modal multi-task joint learning approach,which outperforms state-of-the-art TIRI approaches in terms of performance.Second,this thesis proposes a TIRI method that introduces auxiliary modality translation to enable a more comprehensive understanding of visual modality information.The complexity and difficulty of understanding image information associated with text can sometimes limit a model’s understanding of visual modality information.To address this problem,our proposed method introduces a directional modality translation module,which generates textual descriptions of the original images in the dataset and uses them as supplementary text modal information in the model.Specifically,the method is divided into three steps:1)Utilize a pre-trained directional modality translation module to generate additional textual information by translating the original image;2)Capture the implicit alignment among the original input(text and image)and the pair of text(original and generated)respectively through two layer-wise transformer structures;3)Fuse the multi-modal hybrid representations to perform TIRI.Systematic experiments and extensive analysis demonstrate the effectiveness of our approach with auxiliary modality translation.Finally,this thesis proposes a TIRI method that introduces auxiliary language to overcome problems such as polysemy,sequential complexity,and missing words that may limit a model’s ability to understand text modality in a monolingual environment.The method introduces the Chinese translation of the original text as an auxiliary language in the dataset and builds a graph convolutional network-based TIRI model to fully learn the interaction information between the original data and the Chinese translation.The method consists of three steps:1)Use a pre-trained model to obtain the corresponding feature representations of the original text,the original image,and the Chinese translation;2)Construct a multi-modal relationship graph based on the intra-modal structural relationships and cross-modal alignment relationships of the original text,the original image,and the Chinese translation;3)Combine the multi-modal feature representations and the relation graph as the input of the graph convolutional network to obtain graph convolutionally computed high-dimensional features and pass them into the TIRI classifier.Systematic experiments and extensive analysis demonstrate that our approach with Chinese translation significantly outperforms approaches of LVRI and other text-image classification tasks. |