Font Size: a A A

Research And Improvement Of The Clustering Algorithm Based On Sparsity Score Entropy And Density Entropy

Posted on:2018-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:L S LiFull Text:PDF
GTID:2348330536972416Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis plays an significant role in data mining as a critical analysis tool and method.The object of clustering is to group the unlabeled datasets according to some standards for building intra-class compactness and inter-class separation of cluster memberships.Nowadays,clustering analysis techniques have been largely applied in the field of machine learning,pattern recognition,image processing,information retrieval,statistical science,etc.The research of clustering analysis mainly concentrates on clustering algorithm in order to group data efficiently and accurately.Based on the differences of data type,clustering object and application object,clustering algorithms can be segmented into four categories: partitional,hierarchical,grid-based and density-based.K-means is by far the most popular clustering algorithm used in scientific and industrial applications,and it can also handle large datasets with multi-dimensions comparing to other clustering methods.However,one of its drawbacks is to determine an optimal value of k,to be specified before the algorithm is executed.In most cases,selection of parameters and variables through the users or algorithms has significance for the performance of the algorithm and results,particularly when the data is multi-dimensional,large,continuous and rapid.In addition,DBSCAN has high time complexity and the input parameters also require manual entry.What's more,satisfactory clustering results cannot be obtained on the datasets with varied density by DBSCAN.In this paper,we proposed two novel algorithms: determining the optimal cluster number based on sparsity score entropy in K-means;identifying noises and border points based on density entropy of clustering algorithm.To demonstrate the effectiveness of our methods,they would be tested on synthetic data sets and UCI datasets.The main contributions of this paper are highlighted as follows:(1)This paper presents a new approach to determine the optimal number of clusters in k-means clustering algorithm called Sparsity Score Entropy.This approach mainly focuses on using information entropy theory to select features that were processed through Sparsity Score.In this approach,the optimal sampling shall be conducted first especially for large-scale data sets.Each dimensional feature of a set of sample data points has its sparsity score,which represents its sparse ability of expression.The smaller the sparsity score,the more important this feature is.We use the entropy ratio of the entropy when we remove one feature to the total entropy to select features.The OS validity index is improved with the proposed approach for finding the optimal value of k.Performance was judged on the basis of other nine common cluster validity indexes and other two feature selection methods.Experimental results on four well-known UCI datasets show our approach is often powerful enough to improve the accuracy of choosing k.(2)The density-based clustering algorithms play important roles in handling datasets with uniform density of arbitrary shapes and detecting noises without prerequisite cluster number.However,unevenly distributed complex data in reality have posed a challenge to the existing methods.To automatically process the data with non-uniform density,a density-variable automatic clustering algorithm,is proposed.Our approach consists of the following steps: on the first stage,the rough clustering results,a primary noise set,and a labeled border set are obtained based on the absolute minimum of border degree;on the second stage,a specific border noise set is extracted based on the local minimum of border degree on the labeled border set.During this procedure,a novel ratio of entropy method based on the border degree automatically determine the border set and then adaptively performed cluster analysis for each cluster.The effectiveness of this method has been tested on two synthetic data sets in comparison with four clustering algorithms.An experiment on a real world data set has also been conducted.
Keywords/Search Tags:Clustering analysis, Sparsity Score Entropy, Density Entropy, Automatically determining the parameters
PDF Full Text Request
Related items