Research Of Multi-label Clustering Algorithms And Their Evaluation

Posted on:2014-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Cheng

Full Text:PDF

GTID:2268330401982050

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Clustering is the process of grouping similar data into clusters. It is one of the key technologies in data mining. There are considerable research efforts which have focused on single label clustering but have not concerned the multi-label problems. However, a data point may belong to multiple clusters in the real world. For instance, a document belongs to politics and economy both, a piece of music associates with both happy and relaxing simultaneously. Traditional clustering methods do not consider these facts. Compared with traditional single label clustering, multi-label clustering problems have significant difference, the correlation between categories and co-occurrence make the traditional single label clustering method can not be applied to the multi-label problems directly. Nowadays the existing multi-label clustering algorithms are much less, and they only aim at multi-label clustering for text mining. The algorithms rely overly on the known information, a large number of parameters and setting of threshold. So they do not have versatility.According to above analyses, we focus on the research of the correlation and co-occurrence between the clusters and construct the area of sample points model. According to the ideas of the existing clustering algorithms and combined with space model of multi-label data, we propose two approaches that are a new multi-label clustering method based on distance measure(MCDM) and a multi-label clustering algorithm based on random walk model(MCRW). In the MCDM, firstly, the data are appointed into a cluster by using single label clustering algorithm. Then each cluster is imagined as a ball in the space. The data points located in overlapped clusters may belong to multiple clusters. The idea of the MCRW is that the data points are mapped into the space in order to build a random walk graph. The graph is traversed according to the walking rules. After iteration we get the stable probability distribution. Finally, we get the multiple clusters by setting threshold for probability distribution.F-measure is a widely used clustering validation measure. But it is not suitable for multi-label clustering algorithms. According to the F-measure definition, its value may be out the range [0,1] if it is used in multi-label clustering. We revise F-measure in order to guarantee the value is in the range [0,1]. Experiments on benchmark datasets demonstrate that the multiple clustering proposed is effective and the revision of F-measure is reasonable.

Keywords/Search Tags:

distance measure, random walk, multi-label clustering, multi-labelclustering evaluation

PDF Full Text Request

Related items

1	Research And Implementation Of Multi-label Text Classification Based On User Generated Content
2	Random Walk Learning On Graph
3	Multi-label Classification Algorithm Based On Random Forest And Predictive Clustering Tree
4	Research On Several Issues Of Multi-Label Feature Representation
5	Research On Multi-label Classification Algorithms Based On Samples And Property Analysis
6	Research And Implementation Of A Network Embedding Method Based On Multi-hop Random Walk
7	Joint Classification With Heterogeneous Labels Using Random Walk With Dynamic Label Propagation
8	Multi-label Learning Algorithm And Its Application In Product Evaluation And Scoring
9	Research And Application Of Random Walk Algorithm Based On Distance
10	Research On Multi-Label Learning Algorithms With Distance Metric Learning