Font Size: a A A

Research On Clustering Algorithm Of High Dimensional Data And Its Distance Metric

Posted on:2020-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:J J ShaoFull Text:PDF
GTID:2428330578463923Subject:Software engineering
Abstract/Summary:PDF Full Text Request
High dimensional data is everywhere in our daily life now.How to obtain the information we need from high dimensional data is a research hotspot.For the clustering problem of high dimensional data,we can use three menthods.The one is the traditional clustering algorithm after dimension reduction.The other is the subspace clustering algorithm.And the rest is using the new distance metric which can be used to calculate each sample point to measures the similarity.The work of this paper mainly includes the following two aspects.The appropriate distance metric function has an important effect on the clustering results.For large-scale high dimensional data sets,the incremental clustering algorithm is used to analyze the selection of distance metrics.The SpFCM algorithm divides the large dataset into small chunks to cluster chunk by chunk,it could be easy to get better clustering results in the limited computer memory.Different distance metric functions are applied into the traditional SpFCM algorithm in order to balance the similarities between different objects which can obtain the effect of different distance metrics on SpFCM algorithm.In this research,the four distance metrics which are the Euclidean metric,the Cosine metric,the Correlation distance metric and the extended Jaccard similarity metric are used to calculate the distance for different high dimensional datasets.According to the experiments,the other three distance metrics can improve the clustering effect compared with the Euclidean distance metric.The better clustering results is the way which uses Correlation distance metric.Sometimes the Cosine distance metric and the extended Jaccard similarity distance can get the general results.In order to cluster high dimensional data with Gaussian noise,a new distance metric based incremental clustering algorithm called ANFCM(c+p)is proposed.Due to the traditional FCM is sensitive to the initialization of the cluster center,the proposed clustering algorithm integrates the incremental mechanism of SpFCM.The way is adding several sample points near the cluster center of the previous data block to the next data block for clustering in order to avoid the sensitivity of FCM to noise.In particular,the proposed clustering algorithm takes a new improved distance metric,and then adopts the modified objective function and the modified constraints.Through the above improvements,the new algorjthm can distinguish the influence degree of known and unknown classes in the algorithm and strengthen the interaction degree between classes.The experimental results show that the proposed algorithm has good clustering effect on high dimensional datasets with Gaussian noise and is robust.
Keywords/Search Tags:high dimensional data, fuzzy clustering algorithm, distance metric, similarity study, Gaussian noise
PDF Full Text Request
Related items