Font Size: a A A

Research On Community Discovery Based On Text Attribute Information

Posted on:2024-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WuFull Text:PDF
GTID:2530307073959799Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of the times,the contacts and exchanges between people are no longer limited to the contacts in real life.The contacts and exchanges in the online virtual world contain more information.Social network research came into being,in which the community found that this branching is to divide the nodes in the network to obtain small communities with different functions.Analysis of these communities can make the composition of network nodes clearer.In practical applications,it is also possible to provide exclusive customized services for nodes in the community.Similarly,community discovery is also the basis of many other network researches.Therefore,it is particularly important to obtain an accurate community discovery result.Social networks can be mainly divided into character relationship networks and citation networks,which respectively represent the interaction between people and the citation relationship of paper citations.Individual people or papers can be regarded as network nodes,and the connections between them can be regarded as network connections.The data included include structural data showing the relationship between connections and attribute data showing the characteristics of nodes,In the past,most researches used explicit networks with only structural data or implicit networks with only attribute data for community discovery.Neither of them could well contain the complete information involved in the network.Later,some scholars improved,combined the two,and proposed a fusion attribute community discovery algorithm.However,the attributes used in most of these studies are too simple,only "0-1" variables,which is inconsistent with the actual situation.The node attribute similarity is also mostly tested using unsupervised methods,which is relatively simple.Then this paper selects the "mail gate" data containing real text data information and the citation network data of the four major international statistical journals as the representative of the character relationship network and the citation network for research.Based on this,this paper introduces the Jaccard similarity,designs and proposes two methods: "Bert model+Jaccard similarity" and "Bi-LSTM model integrating Attention mechanism+cosine similarity",and explores the attribute similarity from two modes:unsupervised and "supervised+unsupervised";A text attribute fusion community discovery algorithm is proposed,which combines the text similarity calculation results with the explicit network,and explores community discovery based on the classical method and graph embedding method respectively;Use real networks for empirical analysis,select the optimal fusion attribute community discovery algorithm,extract community topics,and explore the characteristics of task relationship networks and cause networks.Finally,it is found that among the calculation methods of attribute similarity,the similarity results formed by "Bi-LSTM model integrating Attention mechanism+cosine similarity" are more differentiated;Under the two data sets,Louvain algorithm based on "Bi LSTM model integrating Attention mechanism+cosine similarity" is the optimal attribute community discovery algorithm;In terms of network characteristics,the character relationship network has obvious core nodes,which form secondary nodes in turn to form a layer by layer management organization team to serve a person or an event.The citation network is more sparse,and the formed community core nodes are key literatures in the field.The research of citation network can further comb the literature research process.The main contents of this paper are as follows:First,the method of "Bert model+Jackard similarity" and the method of "Bi-LSTM model with Attention mechanism+cosine similarity" are respectively selected to calculate the text similarity.The two methods are respectively unsupervised method and "supervised+unsupervised" method,specifically,"Jaccard similarity","Bert model+Jaccard similarity" and "Bi-LSTM model with Attention mechanism+cosine similarity".The text similarity calculation results obtained from these three methods are fused with the network structure,and the text similarity calculation method suitable for community discovery is selected by comparison.Secondly,a community discovery method based on text attribute information is proposed.Fusion attribute community discovery mainly chooses the fusion attribute community discovery algorithm based on classical community discovery algorithm and graph embedding method to explore.Based on the classical community discovery algorithm,Louvain algorithm and Infomap algorithm are selected in part,and based on the graph embedding method,Gra Rep,Deep Walk and SDNE are selected respectively,which are fused with the text similarity calculation method.The traditional community discovery algorithm is used as a comparison method.The GN algorithm is used to discover the community of explicit networks that only use structural data,and the Kmeans algorithm is used to discover the community of implicit networks that only use attribute data.Finally,the algorithm of optimal fusion attribute community discovery is obtained through empirical analysis with different types of real networks.Thirdly,the "Mailgate" data and the citation network data of the four major international statistics journals are used as the representative of the character relationship network and the citation network for empirical research.Select the data content required in this paper to further clean the data,sort out the node relationship,preliminarily explore the network characteristics,and then perform similarity calculation to complete attribute fusion,and then conduct community discovery.Finally,the LDA model is used to explore the theme of each community,explore the relationship between the communities,and further summarize the characteristics of the character relationship network and citation network.
Keywords/Search Tags:community discovery, clustering methods, attribute similarity, Louvain algorithm, LDA model
PDF Full Text Request
Related items