In many applications field of machine learning, the availability of data tags is usually requires more costly. In some cases, it is very difficult to access to all kinds of the class tags. In recent years, semi-supervised learning has become a research focus in the machine learning field, Semi-supervised learning taking advantage of labeled samples and unlabeled samples to guide the learning process, leading to better learning performance. Research on semi-supervised learning can be divided into two categories, namely semi-supervised classification and semi-supervised clustering. Semi-supervised clustering is to use a small amount of labeled samples and unlabeled samples to guide the clustering process. We studied the clustering of related technology and semi-supervised, introduced the text data preprocessing, distance metrics, the assessment of clustering algorithm and the k-means clustering algorithm based on the constraints.The supervised information is labeled samples selected from collection randomly, these labels are transferred into the Must-link constraint set and the Cannot-link constraints set for the reconstruction of the similarity matrix of the collection, sample re-established the standards of similar or dissimilar among samples. k-means++ algorithm provides an effective method of seeding of clustering, this approach can reduce the sensitive to initial seeds, the clustering accuracy is better than the traditional method of randomly seeding. This paper added the labels impact of the careful seeding process of k-means++ algorithm and proposed a novel k-means algorithm based on the labeled samples and adjusting similarity (LSKM). The experiments on the 20-newsgroup corpus and the Spam email collection show that LSKM consistently outperforms the Seeded k-means and k-means++on both accuracy and efficiency. |