Font Size: a A A

Research On Clustering Algorithem For High Dimensional Data

Posted on:2013-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:L H YiFull Text:PDF
GTID:2218330362962979Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining is a knowledge discovery technique that excavates useful knowledgewhich can be used in decision support, data analysis and other areas from a large data.The research in clustering methods for high dimensional data is a hot and difficultproblem in data mining area.The Euclidean distance function is frequently used tomeasure the similarity between data objects in low dimensional space; the traditionalclustering method is based on the similarity between data objects. Because of the inherentsparsity of high dimensional data, as the dimension of the data object increasing, thedistances based on the Lkmetric between different objects are equal and the comparisonbetween different objects no longer exist, the effectiveness of those similaritymeasurement function will degrade rapidly in high dimensional space. In addition,clusters usually only exist in low-dimensional subspace, evermore, the subspaces may beconstructed by different combinations of dimensions. Because of the curse of highdimensionality, many traditional cluster algorithms which work well in low dimensionalspace will yield poor performances when dealing with high dimensional data. Accordingto the exsiting problem in high dimensional data clustering, the main work in the paper isgiven as below:Firstly, this paper introduces many kinds of traditional similarity measurementfunctions, by analying the characteristics of high dimensional data and the disadvantagesof those traditional measurement functions, this paper proposes a similarity measurementfunction Psim based on secondary equi-depth range of dimensions. This function adoptsthe idea of treating dimensions unequally and divides every dimension into severalmutually disjoint intervals, by analying wheather this two different datas falls into thesame interval on the dimension to decide compute the similarity for this two data on thisdimension or not.Secondly, by analyzing the existing subsapace alglorithms, this paper proposes anSRE-CLIQUE algorithm based on relative entropy and subspace re-refinement.According to the distribution characterisitic of the cluster in high dimensional space, the new algorithm adopts the technique of adaptive grids according to the distribution of thedatasets to reduce the number of the grid in the full data space, using the relative entropyas the standards that distinguish the dense area and the sparse area. The algorithm usesthe punning refinement strategy that is based on the correlation of dimensions to judgewheather the subspace contains a cluster or not.Finally, the experiments had been programmed with language in Matlab and theresult of experiments on synthesis datasets show that Psim can avoid the reducedresolution problem of traditional similiraty measurement function.The new algorithm hadbeen programmed in Matlab, experiments on synthesis datasets show that the newalgorithm achieves a gain of runtime and quality to find subspace clusters.
Keywords/Search Tags:Data mining, Clustering analysis, High dimensional data, Similiratymeasurement, Subspace Clustering
PDF Full Text Request
Related items