Font Size: a A A

Research On Algorithms For Correlation Clustering

Posted on:2020-08-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L HuaFull Text:PDF
GTID:1368330614972189Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering is a very important unsupervised learning method in machine learning.In general,clustering algorithms assign similar samples to the same cluster according to certain criteria,and assign dissimilar samples to different clusters.The most common input to clustering algorithms is the similarity(dissimilarity)matrix,and the elements in the matrix represent the similarity(dissimilarity)between the corresponding samples.Correlation clustering is a special clustering technology whose input is a signed network that simultaneously represents similarity and dissimilarity between samples.The edges that represent similarities are commonly referred to as positive edges and the edges that represent dissimilarities are referred to as negative edges in a signed network.The goal of correlation clustering is that the positive edges in the clustering results only exist within the cluters,and the negative edges only exist between clusters.The traditional correlation clustering algorithms are mainly based on the Integer Linear Programming(ILP)model.The main problem with this type of algorithms are that the algorithms are computationally inefficient and often fails to deal with the increasingly common large-scale social networks and biological networks.This thesis study the correlation clustering algorithms for large-scale signed networks.The study in this thesis plans two technical routes to design correlation clustering algorithms.The first technical route is based on scale reduction for signed networks and ILP model for correlation clustering based on circular inequalities.We designed scale reduction algorithms for unweighted and weighted singed networks,respectively,so that the clustering algorithm can run on smaller scale signed networks.The ILP model based on the circular inequality has higher computational efficiency than the original ILP model.It is applied to the reduced signed network to generate two correlation clustering algorithms.The second technical route seeks a breakthrough of correlation clustering from the segmentation algorithms for unsigned networks.There are already many research results and literature materials for unsigned network segmentation,we draw some of the important ideas to design correlation clustering algorithm.Specifically,the main research works of this thesis are described as follows:(1)Signed network size reduction algorithms based on network structure characteristics.Signed networks such as social networks and biological networks in practical applications follow some typical network structure characteristics,such as power-law distribution of vertices,high clustering coefficient,mutuality,transitive and so on.This thesis analyzes the impact of these network structure characteristics on clustering propensity,and accordingly designs scale reduction algorithms suitable for different types of signed networks: a star structure is proposed for the unweighted signed networks.To reduce the size of network,the star structure searches for vertices within a step range away by the importance of vertices to clustering in the signed network.Then the star are taken as cluster prototypes;A network reduction algorithm based on the principle of Drop of Water is proposed for the weighted signed network.The algorithm searches from important vertices on all search paths.The points belong to a path form a cluster prototype.Regardless of the type of singed network,the cluster prototypes are ultimately designed to be distributed throughout the network for maximum reduction.(2)Local search based on improved ILP model.The constraints in the ILP model of traditional correlation clustering algorithms are constructed based on the triangular inequalities of the fully connected networks.So with the increasing of size of signed network,it will lead to a sharp increase in the number of constraints,which is primary cause of the low efficiency of the traditional correlation clustering algorithms.In this thesis,an improved ILP model is proposed.The constraints are based on the actual circles existing in the signed network,and the circle inequality constraint is used instead of the triangle inequality constraint.Therefore the number of constraints is determined by the number of circles actually present.Since the number of circles in the signed network is much lower than the number of triangles in the fully connected graph,the efficiency of the improved ILP model is much better than the original model.Using the ILP model based on circular inequality,this thesis designs a local search strategy for the reduced signed network to get the final correlation clustering results.(3)Correlation clustering algorithm based on random walk.Random Walk is widely used in unsigned network segmentation because of its intuitiveness,simplicity,and efficiency.This thesis applies the random walk mechanism to the signed network and designs Random Walk Gap(RWG)mechanism to capture the cluster structure information of the signed network.RWG first constructs two derived networks from the original singed network and performs random walk in the two derived networks.Thereby it obtains the multi-step transition probability of the derivative networks.It is found that the edges with different clustering meanings in the singed network have different rules for the changes of multi-step transition probability values in the two derived networks,so three types of edges are defined.The RWG mechanism can find edges that have significant impact on clustering and correct the weights of these edges.Then a very simple greedy clustering algorithm can be used to correlation clustering and achieve very good clustering results.Additional work in this thesis includes designing artificial signed network generation algorithms.One of the reasons for the slow progress in the research of correlation clustering is the lack of experimental data,so this thesis designed the artificial signed network generation algorithms.The algorithms can generate artificial signed networks with various characteristics,such as Pow-law distribution of positive and negative vertices,high clustering coefficients,and so on.Generated signed networks make the comparison of algorithm performance more complete.
Keywords/Search Tags:Signed networks, Complex networks, Correlation clustering, Cluster structure, Community detection, Random walk
PDF Full Text Request
Related items