Due to the popularity of social network platforms and the development of mobile Internet,people can upload multimedia content at any time and anywhere to comment and express emotions.In this way,researchers can accumulate massive multimodal data with great potential research value.The analysis of these multi-modal contents has great practical application significance,such as network public opinion analysis and maintenance of social network content security.Although some achievements have been made in the field of multimodal content analysis,most of the existing research has not focused on the characteristics of social network scenarios.For example,most of the existing methods are based on the assumption that the text in tweets is highly matched with the corresponding image,and believe that the model can better understand the text with the assistance of image information.However,this assumption is not always valid in social network scenarios.Moreover,social network platforms do not limit the structure of tweet content,and most tweets are loose in structure and do not have prominent topics.Existing methods have not processed such unstructured short texts delicately.In addition,the extraction,interaction and fusion of multi-modal features are challenging,so how to make full use of the characteristics of multi-modal information with low-level heterogeneous features and high-levl similar semantics is worth further exploration and research.Based on this,this thesis proposes two novel multimodal content analysis algorithms based on the characteristics of social network platform content: multi-modal named entity recognition algorithm based on Cross-Modal Auxiliary Task framework,CMAT,and Joint multimodal sentiment classification with Unsupervised Key Phrase Extraction,JUKPE.The former proposed a unified multi-task multi-modal learning architecture,which improves the downstream main task performance through two cross-modal auxiliary tasks.One of them is cross-modal matching auxiliary task,which utilizes the calculated cross-modal similarity as the basis to re-weight the different modal features at named entity level,so as to alleviate the problem of different matching degrees of image and text in tweets.The other is the cross-modal mutual information maximization auxiliary task,which aims to enhance cross-modal shared features,meanwhile filtering out modal-specific noise information.The latter,combined with unsupervised key phrase extraction,extracts the prominent topics of the unstructured tweet short text without adding additional annotations.Furthermore,the cross-modal attention mechanism is used to extract the visual features of the opinion entity-level and filter the irrelevant visual regions in the global image.Finally,the interaction between the multi-modal features is strengthened through the co-attention mechanism.This thesis is divided into five chapters.The first chapter briefly introduces the research status and shortcomings of multimodal content analysis tasks in social network scenarios,and expounds the research motivation and content of this thesis.The second chapter describes the basic research and theory related to the main work of this thesis,and introduces some representative algorithms in detail.The third chapter describes multi-modal named entity recognition based on cross-modal auxiliary task framework,and proposes a new multi-task multi-modal learning architecture CMAT.In the fourth chapter,we describe a target-oriented multi-modal sentiment classification model based on unsupervised key phrase extraction JUKPE,which combines unsupervised key phrase extraction technology to highlight the topics of short texts.The fifth chapter constructs a prototype system by combining the two models.The sixth chapter summarizes the whole thesis and looks forward to the direction of the follow-up work. |