Font Size: a A A

A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering

Posted on:2012-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q C LiuFull Text:PDF
GTID:2218330368996013Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In many applications field of machine learning, the availability of data tags is usually requires more costly. In some cases, it is very difficult to access to all kinds of the class tags. In recent years, semi-supervised learning has become a research focus in the machine learning field, Semi-supervised learning taking advantage of labeled samples and unlabeled samples to guide the learning process, leading to better learning performance. Research on semi-supervised learning can be divided into two categories, namely semi-supervised classification and semi-supervised clustering. Semi-supervised clustering is to use a small amount of labeled samples and unlabeled samples to guide the clustering process. We studied the clustering of related technology and semi-supervised, introduced the text data preprocessing, distance metrics, the assessment of clustering algorithm and the k-means clustering algorithm based on the constraints.The supervised information is labeled samples selected from collection randomly, these labels are transferred into the Must-link constraint set and the Cannot-link constraints set for the reconstruction of the similarity matrix of the collection, sample re-established the standards of similar or dissimilar among samples. k-means++ algorithm provides an effective method of seeding of clustering, this approach can reduce the sensitive to initial seeds, the clustering accuracy is better than the traditional method of randomly seeding. This paper added the labels impact of the careful seeding process of k-means++ algorithm and proposed a novel k-means algorithm based on the labeled samples and adjusting similarity (LSKM). The experiments on the 20-newsgroup corpus and the Spam email collection show that LSKM consistently outperforms the Seeded k-means and k-means++on both accuracy and efficiency.
Keywords/Search Tags:Semi-supervised Learning, Semi-supervised Clustering, Text Clustering, k-means algorithm
PDF Full Text Request
Related items