Font Size: a A A

A Study On Semi-supervised Leaning Based On Genetic Algorithm

Posted on:2014-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J J ShengFull Text:PDF
GTID:2308330461972572Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is a commonly used method in data mining field. It partitions the data objects into clusters based on the the degree of similarity between the data objects. The traditional clustering is treated as a method of unsupervised machine learning as it doesn’t use data’s supervised information. Semi-supervised clustering uses the data’s supervised information to aid unsupervised learning.The data’s supervised information include class labels information and pairwise constraints information(must-link constraints and cannot-link constraints). Semi-supervised clustering introduces supervised information into unsupervised clustering, and it uses these supervised information to help clustering’learning. How to use data’s supervised information to aid unsupervised learning effectively is an important issue in the current data mining field.There are many methods of semi-supervised clustering currently. This paper studys on semi-supervised clustering methods from the perspective of genetic algorithm. It mainly starts out from two aspects. On one hand, it studys on new method of introducing pairwise constraints information into semi-supervised clustering, on the other hand, it studys on how to improve the existing semi-supervised clustering algorithm based on the genetic algorithm. This paper’s main contributions and innovations include:1. This paper presents a new concept named distance degree. Each sample has a distance degree. If a sample’s datance degee is high, the sample distribution near it is sparse, and the distances between samples are long. On the contrary, if a sample’s datance degee is low, the sample distribution near it is sparse, and the distances between samples are short.2. This paper presents a new method of using constraints information in semi-supervised clustering. The agglomerative hierarchical clustering (AHC) is one kind of hierarchical clustering methods. It adopts the method of bottom-up. It mergers the atom clusters step by step. But the AHC algorithm doesn’t use data’s supervised information, and it is an unsupervised clustering process. This paper introduces pairwise constraints information into AHC algorithm, and uses samples’distance degree, and then presents a semi-supervised agglomerative hierarchical clustering algorithm based on pairwise constraints (PS-AHC). The algorithm uses pairwise constraints to update the distances of clusters, thus the distances of clusters are made closer to the truth. At last the results of clustering are affected. The results of experiments confirm that the PS-AHC algorithm can improve the performance of clustering effectively.3. The genetic algorithm is an adaptive global optimization probabilistic search algorithm. It is a general algorithm that solves the search problem. The existing semi-supervised clustering algorithm based on the genetic algorithm(LG-SSC) only uses data’s class labels information. It doesn’t use the data’s pairwise constraints information. This paper presents an improved semi-supervised clustering algorithm based on the genetic algorithm(PLG-SSC) which uses data’s class labels information and pairwise constraints information at the same time. This algorithm makes good use of data’s supervised information. In the PLG-SSC algorithm, this paper presents a samples’ assignation method named PFDS which sufficiently reduces the number of pairwise constraints’ violation. The results of experiments confirm that the PLG-SSC algorithm can further improve the accuracy of clustering.
Keywords/Search Tags:clustering, class label, pairwise constraints, genetic algorithm, semi-supervised clustering
PDF Full Text Request
Related items