Font Size: a A A

Research Of Active Semi-supervised Clustering And Its Application In Community Detection

Posted on:2015-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:M W LengFull Text:PDF
GTID:1228330428498959Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Semi-supervised clustering based on labeled data improves the performance of clustering by using the labeled data to guide the procedure of clustering. If the number of labeled data is too small and the labeled data are obtained by sampling randomly from the origin data set, then these labeled data may not be distributed into all clusters of the whole dataset. That is to say, there will exist some clusters which do not contain any labeled data, the existing semi-supervised clustering algorithm will assign the data in these clusters to the other clusters forcibly, and the problem is more serious on multi-density and imbalanced datasets. In this thesis, we carry out the research on the semi-supervised clustering based on small size of labeled data, and carry out the research about active semi-supervised clustering algorithm on multi-density and imbalanced datasets at the same time, and then apply the active semi-supervised clustering algorithms to solve the problem of community structure detection in complex networks. Community structure is one of the most important characters of complex networks, and it is important to discover the community structures of complex networks. Today, community detection has been an important topic in several scientific fields, such as statistical physics, computer science, bioinformatics, etc. Active semi-supervised clustering and its application are studied in this thesis, and the main achievements are as follows:(1) A semi-supervised K-means clustering algorithm based on small size of labeled dataset is proposed, which uses labeled data to calculate the threshold for the similarity between data in the same cluster. If the minimum distance between unlabeled data and the centers of clusters is larger than the threshold, a new cluster center can increase, and repeat the process until the minimum distance between any unlabeled data and the center of every cluster is no larger than the threshold. Since the threshold of similarity between data is obtained using the labeled data, which leads to the result that the number of cluster centers is larger than the real number of the cluster centers, and the final number of clusters is larger than the real number of clusters. We propose a concept of influence factor to measure the similarities between clusters, and the corresponding clusters are merged based on the influence factor between clusters, and we obtain a semi-supervised K-means clustering algorithm with high accuracy. (2) A semi-supervised clustering algorithm based on label propagation and small size of labeled data is proposed, which expands the labeled dataset by propagating the labels of labeled data to their k nearest neighbors. If the distance between the rest of core unlabeled data and the existing clusters is larger than a threshold after the expanding process is finished, then a new cluster is constructed, and it is expanded using the same method. We detect whether there exists a new cluster after the expanding process is finished, and repeat this process until the distances between the rest of core unlabeled data and the existing clusters are all less than a threshold.(3) An active semi-supervised clustering algorithm for multi-density and imbalanced datasets is proposed. First, it uses minimum spanning tree clustering to partition the dataset into some clusters, and from each cluster, the data point with maximum density is selected and labeled by a noiseless oracle; next, the clusters are merged into one if labels of the labeled data in them are same; then, it expands the labels of label data in the cluster based on the density of the cluster, and finally, it deals with the rest of unlabeled data after the expanding process is finished using k nearest neighbor classification criteria.(4) An active semi-supervised community detection algorithm based on label propagation is proposed. Firstly, the proposed method calculates the density of each node based on the common neighbors of nodes, and finds out all core nodes. Secondly, the proposed method actively selects a few core nodes based on the weighted shortest path method. The proposed algorithm guarantees that the selected core nodes can cover as many communities as possible in a network. These selected nodes are labeled by domain experts, and are to be viewed as the initial set of labeled nodes in the process of community detection. Thirdly, the proposed method expands the labeled node set with label propagation. The propagating process labels neighbors of the labeled nodes according to similarity threshold which is obtained automatically based on the characters of networks. Fourthly, the rest of unlabeled nodes are assigned with the label occurs most frequent in their neighbors.(5) A fast active semi-supervised community detection method based on asymmetric similarity is proposed, which improves the efficiency of community detection algorithm by incorporating the community detection into the process of nodes selection partially, and the main processes are given as follow. Firstly, an asymmetric similarity metric is proposed to measure the similarities between nodes. Secondly, the proposed method calculates the density of each node and translates an unweighted network into a weighted directed network based on the proposed asymmetric similarity measure, and it finds out all core nodes. Thirdly, the proposed method selects nodes from the core node set based on densities of nodes, and incorporates the community detection into the process of nodes selection. Fourthly, the proposed method labels the selected nodes by accessing an oracle and merges communities with the same community label, and it deals with the nodes in the area of overlapping communities and the rest of unlabeled nodes according to the similarity between node and community.
Keywords/Search Tags:semi-supervised clustering, k nearest neighbors, active learning, communitydetection, label propagation
PDF Full Text Request
Related items