Font Size: a A A

Research On Semi-supervised Clustering Algorithm With The Priori Knowledge

Posted on:2013-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:2248330371996848Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Clustering is a commonly used method in data mining. In the real world application and academic research, clustering is commonly used to data analysis and data category. As an important part of clutering, Semi-supervised clustering has been a hot research field in the past decade. In semi-supervised clustering, a small amount of a priori knowledge are used to assist in clustering result seeking, improving clustering accuracy or effeciency. In semi-supervised clustering, the form of priori konwledge is various, which includes the type of the data label information and pairwised data category information. And, we used labeled data and pairwised constrained data indicate the data have these information, respectively. Due to the different distribution of priori knowledge, the effect of priori knowledge are different, it may even bring a negative impact on the clustering, reducing the clustering accuracy. In this paper, we evaluate the importance of priori firstly. Depending on the improtance criterion, we deal with the priori knowledge for better using it to improve the semi-supervised clustering accuray,For different a priori knowledge, we did the research as following.(1) This paper analyses the negative effects of labeled data in the clustering process. Then, a semi-supervised clustering algorithm with labeled data is presented. In this paper, the importance evaluation criterion for the labeled data are presented. Then, we use special method with labeled data to assist the initialization and clustering process, the LDP initialization method and double adjustable strategy, respectively. The LDP method uses modified similarity criterion and label propagation strategy to optimize the initial cluster seeds. In the clustering process, the principle of double adjustable strategy is the interaction selection between the label data and unlabeled data to improve the accuracy of the label propagation. In the process of clustering solution search, the intensity of label data is dynamically adjusted.(2) By analyzing the effects of different pairwise constrained data in the clustering process, this paper presents a semi-supervised clustering algorithm with pairwise constrained data. Firstly, the concept of Clique is introduced. The algorithm integrates the distributed pairwise constrained data to form some cliques. Then, the modified clustering algorithm evaluates the importance of the clique. Based on the importance of different cliques, the modified clustering algorithm takes specific method to improve the effect of constraints, including the potential function and feedback function. In the potential function, the modified penalty for constrains violating and constraints propagation method are used to enhance the important constraints’impact in the clustering process, improving the seeking of clustering result. In feedback fucntion, the modified semi-supervised clustering calculates the damage of clique to merge or split the cluster, in order to improve the clustering solution seeking.In this paper, we experimental validate the validity of the modified semi-supervised clustering algorithm with multiple data sets and analyze the reasonable experiment results. The experiments show that the semi-supervised clustering algorithm with the priori knowledge importance, can reduce the negative effects of a priori knowledge of the algorithm, optimize clustering solution seeking, and improve the accuracy and stability of the clustering algorithm.
Keywords/Search Tags:Semi-supervised Clustering, Labeled Data, Pairwised Constrain Data, LabelPropagation, Constrain Violation penalty
PDF Full Text Request
Related items