Font Size: a A A

Research Of Multi-label Clustering Algorithms And Their Evaluation

Posted on:2014-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ChengFull Text:PDF
GTID:2268330401982050Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering is the process of grouping similar data into clusters. It is one of the key technologies in data mining. There are considerable research efforts which have focused on single label clustering but have not concerned the multi-label problems. However, a data point may belong to multiple clusters in the real world. For instance, a document belongs to politics and economy both, a piece of music associates with both happy and relaxing simultaneously. Traditional clustering methods do not consider these facts. Compared with traditional single label clustering, multi-label clustering problems have significant difference, the correlation between categories and co-occurrence make the traditional single label clustering method can not be applied to the multi-label problems directly. Nowadays the existing multi-label clustering algorithms are much less, and they only aim at multi-label clustering for text mining. The algorithms rely overly on the known information, a large number of parameters and setting of threshold. So they do not have versatility.According to above analyses, we focus on the research of the correlation and co-occurrence between the clusters and construct the area of sample points model. According to the ideas of the existing clustering algorithms and combined with space model of multi-label data, we propose two approaches that are a new multi-label clustering method based on distance measure(MCDM) and a multi-label clustering algorithm based on random walk model(MCRW). In the MCDM, firstly, the data are appointed into a cluster by using single label clustering algorithm. Then each cluster is imagined as a ball in the space. The data points located in overlapped clusters may belong to multiple clusters. The idea of the MCRW is that the data points are mapped into the space in order to build a random walk graph. The graph is traversed according to the walking rules. After iteration we get the stable probability distribution. Finally, we get the multiple clusters by setting threshold for probability distribution.F-measure is a widely used clustering validation measure. But it is not suitable for multi-label clustering algorithms. According to the F-measure definition, its value may be out the range [0,1] if it is used in multi-label clustering. We revise F-measure in order to guarantee the value is in the range [0,1]. Experiments on benchmark datasets demonstrate that the multiple clustering proposed is effective and the revision of F-measure is reasonable.
Keywords/Search Tags:distance measure, random walk, multi-label clustering, multi-labelclustering evaluation
PDF Full Text Request
Related items