A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering

Posted on:2012-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:Q C Liu

Full Text:PDF

GTID:2218330368996013

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In many applications field of machine learning, the availability of data tags is usually requires more costly. In some cases, it is very difficult to access to all kinds of the class tags. In recent years, semi-supervised learning has become a research focus in the machine learning field, Semi-supervised learning taking advantage of labeled samples and unlabeled samples to guide the learning process, leading to better learning performance. Research on semi-supervised learning can be divided into two categories, namely semi-supervised classification and semi-supervised clustering. Semi-supervised clustering is to use a small amount of labeled samples and unlabeled samples to guide the clustering process. We studied the clustering of related technology and semi-supervised, introduced the text data preprocessing, distance metrics, the assessment of clustering algorithm and the k-means clustering algorithm based on the constraints.The supervised information is labeled samples selected from collection randomly, these labels are transferred into the Must-link constraint set and the Cannot-link constraints set for the reconstruction of the similarity matrix of the collection, sample re-established the standards of similar or dissimilar among samples. k-means++ algorithm provides an effective method of seeding of clustering, this approach can reduce the sensitive to initial seeds, the clustering accuracy is better than the traditional method of randomly seeding. This paper added the labels impact of the careful seeding process of k-means++ algorithm and proposed a novel k-means algorithm based on the labeled samples and adjusting similarity (LSKM). The experiments on the 20-newsgroup corpus and the Spam email collection show that LSKM consistently outperforms the Seeded k-means and k-means++on both accuracy and efficiency.

Keywords/Search Tags:

Semi-supervised Learning, Semi-supervised Clustering, Text Clustering, k-means algorithm

PDF Full Text Request

Related items

1	A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering
2	Semi Supervised Clustering Algorithm And Its Application And Research
3	Semi-supervised Learning On Text Data
4	Distributed Clustering And Evolutionary Clustering Algorithm Based On Semi-supervised Learning
5	Research On Risk Degree-Based Safe Semi-Supervised Fuzzy Clustering Algorithm
6	Research And Application Of Semi-supervised Clustering Algorithms
7	Research On Semi-supervised Classification Algorithm Based On Clustering Ensemble
8	Research On Text Clustering Based On Semi-supervised Learning
9	Research On Semi-supervised Learning And Its Application
10	Research And Application Of Active Semi-supervised K-means Clustering Algorithm