Font Size: a A A

Research On Subspace Clustering Algorithms For High-dimensional Data

Posted on:2013-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2268330392470525Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
With the development of information technology and Internet, high dimensionaldata such as multi-media data and gene microarray data on the Internet is growing ex-ponentially and their attributes (dimensions) can amount to several hundreds. In suchcircumstances, high dimensional data clustering technique is one of the most importantmethods for analyzing high dimensional data.The characteristics of high dimensional data difer so much from those of the low di-mensional data. For instance, the similarity measurement which is commonly utilized inlow dimensional data clustering will not contribute to excellent clustering results any morein high dimensional space, and some attributes are correlated with each other to some ex-tent and the subspaces are possibly spanned by diferent combinations of attributes. Allthese particular features of high dimensional data make high dimensional data clusteringtechnique a quite challenging task. How to study high dimensional data clustering tech-niques based on the well-developed theory of data mining is critically important when toefectively instruct the new direction of Internet development.This thesis focuses on the research of high dimensional data clustering techniques.We firstly summarized the prevalent methods and current situations of high dimensionaldata analysis and categorized the existing high dimensional data clustering techniques,such as dimension reduction, manifold learning, distance metric learning, subspace clus-tering, etc. Then we focused our attention on the subspace clustering methods to furtherstudy high dimensional data clustering techniques. After we deeply studied and improvedthe bottom-up based subspace clustering methods, we proposed a novel subspace clus-tering method based on kernel density estimation and the intensive experiments showedthe superior efectiveness and efciency of our proposed method. The main contents andcontributions can be summarized as follows:1. We firstly introduced the subspace clustering problem for high-dimensional dataand then studied the bottom-up based subspace clustering algorithms in depth. In the endof chapter2, the density divergence problem is introduced for further study.2. We proposed the kernel density estimation based on subspace clustering algorith-m to efectively address the dilemma of grid partition and the density divergence problem.Some related techniques are first introduced and the basic terms and definitions are de-fined. Subsequently, the detailed algorithm is explicitly described in the end of chapter3.3. We conduct intensive experiments on both synthetic and real datasets and theperformance comparisons on algorithm scalability, accuracy and efciency with existingsubspace clustering algorithms show the superiority of our proposed algorithm. 4. Finally our visions for distributed concurrency framework and extending ouralgorithm to combine numerical and categorical attributes are presented in conclusion.
Keywords/Search Tags:High-dimensional data, Clustering analysis, Subspace clustering, Kernel density estimation
PDF Full Text Request
Related items