Improved K-means Clustering Method And Its Application

Posted on:2015-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:H R Li

Full Text:PDF

GTID:2298330431473504

Subject:Management Science and Engineering

Abstract/Summary:

With the popularization of computers, huge amounts of data are accumulated every day, traditional database management systems cannot meet the real needs, so data mining was introduced, it can reveal implicit, previously unknown information. Cluster analysis is an important field of data mining, which can reveal not only the difference between datasets but also provide an important basis for further discovery. Existing clustering algorithm is divided into five categories:the algorithm based on divisions, hierarchical, density, mesh, mode algorithm and soAmong the division algorithms, K-means clustering can find and recognize patterns and trends in a large data without prior information such as class labels. It divides the all samples into different clusters so as to minimize the within-cluster distance and maximize the distance between different clusters. However, it is only suitable for the spherical or sphericity-analogous data, which is determined by its similarity measure called Euclidean distance, because all features are coped with the same weight and contribution to classification. Therefore, the similarity measure is a vital factor in the clustering final performance; we cannot get optimum clustering results unless we adopt the appropriate similarity measure. Accordingly, we propose two novel K-means clustering by adopting other similarity measures to be suitable for the nonnegative, ellipsoidal or spherodicity-analogous data, which is more common in real application.(1) A novel K-means clustering based on I-divergence distance measure. Originally, I-divergence was introduced into statistics to measure the differences between measured value and true value. It has been widely used for solving positive, linear inverse problems, and it has the advantages of regularity (consistency, distinctness and continuity) and locality. Experimental results of simulated data and UCI data show that the proposed novel K-means clustering is suitable for the ellipsoidal or spherodicity-analogous data.(2) A novel K-means clustering based on Max Entropy. The Max Entropy principle was first expounded to guarantee the uniqueness and consistency of probability assignments. Different features without any independence requirement can be merged into one probability model, which is the significant characteristic of Max Entropy. It has been widely used for solving positive, linear inverse problems. Additionally, it is used to measure the discrepancy between observed value and generated value. Experimental results of simulated data and UCI data show that the proposed novel K-means clustering is suitable for the ellipsoidal or spherodicity-analogous data, and has a better clustering performance.(3) The improved K-means algorithm is applied to practical problems, which is the agricultural development in major cities in the Northeast positioning research. We select specific indicators that will categorize36major cities in order to analyze its agricultural development. In the divided three clusters, the first cluster includes4cities of Heilongjiang Province, the second cluster includes26cities, and the third cluster includes6cities. The resuluts indicates that Heilongjiang province as the main agricultural development, the main crop yields and area are superior to other provinces; the agricultural development Liaoning Province has certain advantages of the main crop yield, area, and it status in the second cluster of the three northeast provinces; Jilin Province status in the third cluster of the three northeast provinces.Finally, we hope to help improve clustering algorithm research in theoretical aspect by improving research on the traditional K-means clustering algorithm, and broaden the range of practical applications, which is the agricultural production, machine learning, pattern recognition, business decisions and other fields.

Keywords/Search Tags:

Cluster analysis, K-means, Similarity measure, I-divergence, Max Entropy

Related items

1	Similarity Measures In Cluster Analysis And Its Applications
2	Similarity Measure And Its Application In Image Non-local Filtering
3	Multi-granulation Rough Sets And Granular Reductions Based On Similarity Measure
4	Studies On Semi-supervised Clustering Algorithms Based On Entropy And Divergence
5	Cluster Analysis And Its Application On Image Processing
6	The Analysis Of K-means Cluster Algorithm For Website Content
7	Studies On Clustering Algorithms For Categorical Data
8	Research On Optimization Methods For Kernel K-means
9	Research On Approximate XML Joins
10	Integration Of Research And Application Of The Algorithm Based On The Improvement Of The K-means Clustering