| With the development of science and technology,there exists a large amount of multimodal data,for example,in social networking platforms,users are more inclined to post their opinions and emotions through multiple expressions,commonly in the form of text,images,audio,and video.The main challenge in today’s big data era is to analyze multimodal data through modeling and mining the information contained in them.There are many related researches on the recognition of user depression on social networks,but most of the previous researches are based on the single data type of text.However,a single text data cannot carry out multi-angle analysis of depression,it only judges people’s complex emotional tendencies through the semantic relationship between texts and words.For a sample such as the text "I’m fine" and a low-key picture,it is obvious that a single text feature model cannot learn such noisy data.Too much of this data can cause the model to lose accuracy.In order to clearly analyze the emotional tendency of users of the social network platform Sina Weibo,this thesis proposes to use the analytic network process to analyze whether these users have a depressive tendency from some rich text information posted by users on the social network platform Sina Weibo,i.e.,posts that contain both pictures and text data as samples.Multimodal learning using data from different modalities allows the model to combine the features of different types of data to discriminate depressive mood of social network users.In order to carry out the research on the recognition of depressive tendencies of social network users based on text and images,and to further realize the multimodal depressive sentiment recognition model that can utilize both text and image modal data,the main work and innovation points accomplished in this paper are as follows.(1)In this paper,two datasets containing two modalities of text data and picture data were constructed based on the construction of related literature and published datasets.The data set after preliminary cleaning of the original data is called data set 1.Because the text data in data set 1 is extremely right biased and scattered,the length of the text in data set 1 is limited to make its character length greater than 5,and the length of depressed and non depressed text is less than its 95 th quantile respectively,so data set2 is constructed.Each of these samples contains a user avatar image,and the images contained in the posted posts,which are used for the comparison test of the multimodal depressive mood recognition model.(2)Aiming at the fuzziness and ambiguity of text language,only using a single text data is not enough to identify depressed users with high precision.In order to identify depressed users with high precision,this thesis abstracts the two kinds of data fusion problems from the perspective of deep learning: text and picture data,and understands them as text features and visual features.The social network platform microblog user depression recognition problem is transformed into a multi feature binary classification problem.Using different types of data for multi feature learning,the model can distinguish whether social network users are depressed or not combined with the characteristics of different types of data.(3)Aiming at the problem of multi feature binary classification,this thesis constructs a depressed user recognition model MXA based on user published text data and user avatar data.The idea of the model is that a single text original data text sequence will be mapped to the semantic space through the embedding layer,and the semantic information contained in other types of data,will also change the position of words in the semantic space.Theoretical analysis and experimental results show that compared with the traditional language model,the pre-training language model XLNet can more fully capture the semantic relationship of text sequence.Therefore,this thesis uses the pre-training language model XLNet to construct text features,constructs image features with dimension of 1000 through Xception,and then extracts image semantic features through multi feature adaptive gate.Add the private semantic features to the text semantic vector space,and make the word vector offset in the semantic space. |