| In recent years,with the development of Internet technology,a large amount of information has been presented to people in different forms such as video,text and voice.How to effectively analyze and utilize these information has gradually become a hot research problem in the field of multimodal.As one of the key technologies in multimodal field,multi-modal joint representation technology has received extensive attention.The correct and appropriate multi-modal joint representation can make use of the complementarity and consistency of different modalities to obtain better feature representation and provide more accurate help for downstream tasks,which plays an extremely important fundamental role in multi-modal classification,multi-modalities matching and multi-modal retrieval tasks.Therefore,the research on multimodal combined said oriented on the one hand,enrich the content of research in the field of multimodal has important academic value,and on the other hand,make full use of vast amounts of information of different modal can analyze users behind idea,the multimodal task for the enterprise provides the high quality characteristics,it has very important application value.This paper mainly divides the multi-modalities joint representation into three aspects(modal fusion,modal unity and modal matching)according to the different angles of multi-modalities joint,and carries out work on these three aspects respectively:1)Modal fusion:The multi-modal joint representation based on multi-layer LSTM.This method is responsible for combining text,image and speech modalities and is applied to downstream emotion classification tasks.2)Modal unity:The Modal joint representation based on variational distillation.This method obtains a modal joint encoder focusing on multiple tasks in single modal scenarios through variational mutual information distillation;3)Modal matching:The Multi-modal joint representation based on comparative distillation,This method distills teacher models from coarse-grained and fine-grained perspectives to obtain a lightweight text and image modal joint encoder,which is used for downstream multi-modal matching tasks.Specifically,the main content of this paper is divided into the following three parts:(1)Multi-modal joint representation method based on multi-layer LSTM.In order to solve the problem that the traditional multi-modal fusion method focuses more on feature extraction of single modality,but ignores the relationship between multiple modalities,a method of feature mining which focuses on both intra-modal and intermodal is proposed.In this method,a single-modal LSTM feature extraction layer and a multi-modal LSTM feature association layer are designed to represent the video information of text,image and speech from two perspectives of intra-and intermodal.The effectiveness of the proposed method was verified on the CMU-MOSEI multimodal sentiment analysis dataset.(2)Modal joint representation method based on variational distillation.To solve the problems that most of the current multi-modal pre-training models are not suitable for single-modal scenarios,and require a large number of hard-to-collect multi-modal aligned corpora for pre-training,and the number of model parameters is too large to be deployed in the actual environment,we propose a modal joint encoder that focuses on single-modal from the perspective of mutual information.The method uses variational mutual information distillation to disintegrate text and image information into a small student model,which does not need to align the corpus and greatly reduces the number of model parameters.The effectiveness of the above method was verified on 8 natural language processing corpora of GLUE platform and CIFAR10,CIFAR100 and ImageNet1000 tasks in the image field.(3)Multi-modal joint representation method based on contrast distillation.Aiming at the problem that multi-modal matching or retrieval models are often too complex to be calculated and difficult to deploy due to the large number of parameters in pursuit of performance in practical scenes,we propose a multi-modal joint representation method for modal matching based on comparative distillation.The method uses contrast learning to compress the large-scale multi-modal pre-training model into a smaller student model from both coarse-grained and fine-grained perspectives,and does not require too large multi-modal training data set.The validity of the proposed method was verified on FILcker-8K,FILcker-30K and MSCOCO datasets. |