Font Size: a A A

A Study On Semi-supervised Clustering Algorithm Based On Domain Knowledge

Posted on:2010-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:H C HuangFull Text:PDF
GTID:2178360278480500Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of the basic tasks in data mining in which the data objects are partitioned into clusters based on certintain similarity. Clustering analysis can be seen as an unsupervised learning process from the perspective of machine learning. In general, the unsupervised learning does not require the class label information before analysis. However, in real world, people have some knowledge about the data to be analyzed. In most cases, the useful information is generally ignored in most traditional scenarios.Semi-supervised clustering is proposed to utilitize the knowledge to guide the clustering process and improve the clustering result. It has been assured that semi-supervised clustering can achieve better clustering result. Recently, Semi-supervised clustering has become one of the hot research topics in the area of clustering.This paper has study on semi-supervised algorithm method and application result from the perspective of constraints, attributes, rules and real-world application. This paper's main contributions and innovations include: 1) This paper analyses COP-KMeans clustering algorithm and points out its disadvantages, induces the dispatching method based on constraint set and the concept of assistant centroid, brings forward the improved version called MLC-KMeans and confirms its efficiency with experiments on on several UCI data sets;2) This paper tries to discover the relation not only between attribute and class label, but also between attribute and constraints.On one hand, it adopts the attribute reduction methods, decreases the attribute number through analysis of labeled data objects and process clustering on new attribute set. On the other hand, it finds new constraints by restricting attribute scope of old constraints, and then uses that to direct clustering. Both of two methods achieve good result;3) This paper also uses the associate rule to discover relation of attribute sub-set and class label by analyzing partial labeled data in data set, and uses the rule as previous domain knowledge, adds it to clustering process to improve clustering result. Semi-supervised clustering method based on associate rules makes good use of rule information and demonstrates the application in semi-supervised clustering of previous knowledge derived from data mining method and constraint relation among attribute subset;4) Last but not least, the semi-supervised clustering technique proposed in this paper is used in real-world application by using it to the cluster of web users. This paper gives a detail description of the process from deriving of web log to clustering analysis.
Keywords/Search Tags:Data mining, Semi-supervised clustering, Domain knowledge, MLC-KMeans, Attribute reduction
PDF Full Text Request
Related items