Research On Community Detection And Classification In Networked Data Using Probabilistic Generative Models  Posted on:20140810  Degree:Doctor  Type:Dissertation  Country:China  Candidate:Z W Wang  Full Text:PDF  GTID:1108330479479633  Subject:Management Science and Engineering  Abstract/Summary:  PDF Full Text Request  Many systems in the real world consist of interconnected entities. Networks provide a formal way to represent such systems. Driven by the development of information technology, people are able to observe more systems in the real world and store more observation data. As a result, various networked data appears. For example, the World Wide Web, online social networks, journal citation networks, collaboration networks between scientists, and so on. In order to analyze networked data, it is urgently required to study the data mining technologies on networked data. Community dection and classification in networked data, which aims to cluster and classify the nodes using the edges in networks, are both the important technologies of the data mining on networked data. They play the important roles in the study of network structure, entity clustering and classification as well as the storage of networked data.In resent years, the problems of community dection and classification in networked data have been received wide attention and a lot of methods are developed. However, the present methods cannot provide reasonable results in all types of networks. For example, the methods of optimizing modularity may discover meaningful communities from the networks which have no community structure; classification methods based on the homophily assumption have poor performance on the networks with low homophily. Probabilistic generative model can model data based on some assumption about the process of generating data. Statistic inference methods are exploited to fit probabilistic generative models to the actual data and the latent characters behind the data can be inferred. The methods based on probabilistic generative models only depend on model assumptions and actual data. Different model assumptions can be used to build models for different types of data. In order to improve the performance of community detection and classification in networked data, the principle of probabilistic generative models is utilized to study the problems of community dection and classification in networked data. The main contributions of this paper are as follow:(1) A method based on the node community model is proposed. Based on the idea of node community that communities comsist of nodes and one node belongs to one community, the nonparametric method is exploited to build the probabilistic generative model for networks. The main idea of this model is that the probability with which node A is connected to another node B is equal to that with which the community of the node A is connected to the node B. In this model, the communities of all nodes are first generated using the the nonparametric method, and then all edges are generated based on the communities of nodes. After the latent variants in the model are calculated using Gibbs sampling, the communities of all nodes can be determined. Because of the usage of the nonparametric method, the number of communities can be determined during the process of calculating the latent variants. The experiments on synthetic and realworld networks show that this method is an effective method for detecting nonoverlapping communities.(2) A method based on edge community model is proposed to address the problem of detection overlapping communities. Based on the idea of edge community that communities consist of edges, a probabilistic generative model about the network and its edge communities is built. The model parameters are calculated using the nonparametric method and the probabilities with which a node belong to communties are calculated using the model parameters. Because of the idea of edge community, a node can belong to more than one community. The number of communities can be determined during the process of calculating the latent variants using the nonparametric method. The experiments on synthetic and realworld networks show that this method can extract the overlapping communities from networks. Besides, this method can also provide the participation degree in the communities of which the node is a member.(3) A method based on a probabilistic generative model is proposed for classification in networked data. According to the connecting character in the networks with low homophily, the class propagation distribution is proposed to describe the connecting probability of two nodes. Based on the class propagation distribution of nodes, a probabilistic generative model for networks is built. In this model, the edges of networks are observed variables and the classes of the nodes whose classes are unknown are latent variables. The values of the latent variables can be calculated by fitting the generative model to the network. Experimental results on the real datasets show that the proposed method can provide better performance than the previous methods in the networks with low homophily.(4) A network visualization tool called ADraw is developed to visualize node communities and classes while drawing networks. In ADraw, A multiphase clustering layout algorithm is developed to calculate the placement of nodes based on node attibutions. This algorithm does not only make the nodes with same attribute values closer, but also places the multivalued nodes between the corresponding singlevalued nodes. ADraw represents the multivalued nodes using the pie charts in which the sectors with different colors represent their different attribute values. A attributebased coloring algorithm is developed to calculate the positions of the the sectors. The coloring algorithm can make the sectors closer to the nodes with the same colors.  Keywords/Search Tags:  data mining, networked data, community detection, classification in networked data, probabilistic generative model, nonparametric method  PDF Full Text Request  Related items 
 
