Font Size: a A A

Research On K-means Clustering Algorithm Based On Semi-Supervised Good Point Set And Leader

Posted on:2012-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2218330338970514Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology and the internet, the data base application has been enlarging in term of dimension, area and depth, as well as the capacity of the production and collection of data have been improving, this will lead to the accumulation of a large number of data in various fields of real life. How to find the intrinsic relationship between these large-scale data, so that the hidden information can be extracted and knowledge resources can be built, this has been a hot topic.Data mining is the procedure of extracting of implicit, valid, novel and potentially valuable knowledge and ultimately understandable patterns or knowledge from large amount of data, which is widely applied in many areas in real life. Clustering analysis is one of the three major areas of data mining, which is widely studied for several decades and it has been achieved a mass of theories and methods. While the K-means clustering based on the partitioning method, is the most classical algorithm.K-means clustering algorithm is easy to achieved, scalable and high efficient for disposing big data set, as well as the complexity of time is nearly linear. However, there are inherent shortcomings of this algorithm, for example, 1)it is very sensitive to initial conditions,2)often gets trapped in local minimum,3)it is adapt to numerical data and has only the best capability to capture clusters inhyperspherical shape.In this paper, in-depth study and analysis of the topical clustering of K-means summed up its strengths, weaknesses and some improvement methods in recent years. For the shortcomings of the K-means, many of the relative improvement methods and strategies have been given by the scholars, especially for the defects of the 1) and 2). This paper focus on the sensitive of the K-means clustering algorithm to the initial value and combine with the semi-supervised learning, Leader approach and Good Point Set theory, presenting two new initial centers selection algorithms.The summary of work: Based on Semi-Supervised and Leader method, it is proposed that a new method of selecting the initial K-means clustering centers, i.e. S_SLK algorithm. This method that improves the instability of the clustering results with randomly selecting initial centers, which used supervised information to improve the performance of unsupervised learning and combined with the Leader approach, which could keep the distribution characteristics of the data object.A new improved K-means clustering algorithm was proposed, which adopted the theory of Good Point Set and Leader method. The theory of Good Point Set can produce points better than randomly selected's. In this article, the binding mode of the Good Point Set and Leader approach is reflected from the two algorithms, called KLG and KGL algorithm.For the improved KLG and KGL algorithm, we conducted a large number of experiments to show those effectiveness and feasibility. At the same time, we took the experimental results to compare with the traditional algorithm and the literature algorithm. Experiments and the result of comparison show the improved algorithms significantly outperform the traditional and other initialization methods. And finally we come to the KGL algorithm is better than other algorithms which listed.
Keywords/Search Tags:Data Mining, Clustering Analysis, K-means Clustering Algorithm, Semi-Supervised, Good Point Set, Leader Method
PDF Full Text Request
Related items