Clustering High-Dimensional Data Using PCA-Hubness

Posted on:2018-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:J T Lang

Full Text:PDF

GTID:2348330533961358

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clustering analysis classifies some similar objects into different clusters or more subsets by static classifications.The traditional clustering analysis algorithms used to get good clustering effects in low-dimensional data space,but it is bad in high-dimensional data space,which is mainly caused by the curse of dimensionality in high dimensional data space.One of the effects of curse of dimensionality is distance concentration,Hinneburg and Aggarwal et al.have conducted in-depth research on distance concentration and meaningless nearest neighbors in high dimensional data.Another aspect of the impact of curse of dimensionality is hubness phenomenon,this paper will be analyzed the underlying causes from this new direction.The concept of hubness was originally proposed by Milos Radovanovic et al.in 2010,and hubness describes a phenomenon in which some objects tend to appear frequency in the nearest neighbor lists of other objects.Milos Radovanovic et al.proposed four kinds of hub clustering analysis algorithms with this attribute.Although hub clustering algorithm can be clustered in high dimensional data space,it cannot eliminate redundant and noise data in high dimensional data space,thus it cannot obtain better cluster structures and faster clustering convergence.In this paper,PCA-Hub clustering analysis algorithm based on inverse neighborhood skewness is proposed to solve the problem of hub clustering analysis algorithms.This algorithm can solve the redundant and noise data in high dimensional data space and can get better cluster structures and faster clustering convergence rate.The experimental results show that the PCA-Hub clustering algorithm has an average increase of 15% in the silhouette index compared with previous clustering algorithms.When dimensionality of data set or skewness of inverse neighborhood is high,PCA-Hub clustering algorithm has not a strong relationship with the number of k nearest neighbor.The results show that the results of PCA-Hub clustering algorithm are consistent with consistency of experimental environment and experimental parameters.The PCA-Hub clustering algorithm can solve redundancy and noise features in high-dimensional data space largely.However,with the increase of number of data set samples and dimensionality of data set,time complexity of PCA-Hub clustering algorithm will become more and more serious or even unacceptable.Therefore,this paper proposals a Quick PCA-Hub clustering analysis algorithm to quickly accelerate the clustering analysis speed of PCA-Hub algorithm.The experimental results show that Quick PCA-Hub clustering algorithm improves silhouette index by 8% on average compared with previous clustering algorithm.When searching for ideal k-principal components in high-dimensional data space,Quick PCA-Hub over perform than PCA-Hub dramatically.Therefore,PCA-Hub algorithm proposed in this paper can solve the problem that hub clustering algorithms cannot deal with redundancy and noise features in high-dimensional data space,and have proved the effectiveness of PCA-Hub algorithm from various experiments.In order to solve the problem that time complexity of searching for k principal components is too high in the PCA-Hub clustering algorithm,Quick PCA-Hub clustering algorithm solves the problem by searching the k principal components quickly.Experimental results show that the algorithm in high-dimensional data space has a good performance.

Keywords/Search Tags:

hub clustering, high-dimensional data, skewness, intrinsic dimension, principal component analysis

PDF Full Text Request

Related items

1	Research On Dimension Reduction Methods Of High Dimensional Data
2	The Application Of Clustering Analysis Based On Principal Component Analysis And Rough Set In Financial Index Data
3	Estimating the Intrinsic Dimension of High-Dimensional Data Sets: A Multiscale, Geometric Approach
4	Clustering Algorithm Research Based On The Bilinear Probabilistic Principal Component Analysis
5	Research On Clustering Algorithms For High Dimensional Nonlinear Data
6	Research On High-Dimensional Index In Large-Scale Image Databases
7	Application Of Principal Component Analysis And Clustering In Science And Technology Data Analysis
8	Research And Application Of Density Clustering Algorithm Based On Kernel Principal Component And High Dimensional Distance
9	Recognition Method Of Metal Fracture Images Based On Empirical Ridgelet And Principal Component Analysis
10	Research On Noise And High Dimensional Problems In Clustering