Font Size: a A A

Research On Clustering Methods For High Dimensional Data And Their Applications

Posted on:2009-09-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:L F ChenFull Text:PDF
GTID:1118360272488886Subject:Artificial Intelligence
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important research in data mining, and has been widely used in many fields, such as message filtering, document categorization, bioinformatics, etc. In those fields, the data are always of high dimensions. For examples, the document data and gene microarray data are generated in several hundreds or even a thousand attributes (or dimensions). The universality of these data makes researches on high dimensional data clustering more and more important.The characteristics of data objects in high dimensional space are quite different from which in low dimensional space. In many cases, the effectiveness of similarity measurement which is usually adopted in low-dimensional data clustering, such as L_p-norm, will degrade rapidly in high dimensional space, due to the inherent sparsity of the data. In addition, clusters usually only exist in some low-dimensional subspaces, moreover, the subspaces may spanned by different combinations of dimensions within high dimensional data. Due to the curse of high dimensionality, many methods which work well on low-dimensional data will yield poor performances when clustering high dimensional data.In order to address these problems, some new methods are proposed in this thesis, which focuses on the issues of new subspace clustering algorithms and high dimensional cluster validition, based on subspace cluster modeling. The methods mentioned above are also used in text categorization, network intrusion detection and malware detection. The researches in this dissertation have much important theoretical and practical significance.The majority of our contributions can be summarized as follows:1. A probability model for describing the subspace clusters in high dimensional space as well as its learning algorithm and clustering objective function is presented.2. Some recent soft subspace clustering algorithms are improved in terms of stability and clustering accuracy, by analyzing their relationships with the probability model. The algorithms are further improved in terms of robustness by embed local outliers detection. 3. A new definition of the fuzzy membership has been derived based on the probability model, and a fuzzy algorithm for subspace clustering on high dimensional data is proposed. Furthermore, three traditional cluster validity indices are improved to meet with the requirements of subspace clustering. Combing with the fuzzy algorithm, the new subspace cluster validity indices are used to estimate the number of subspace clusters in high dimensional data.4. A hierarchical method is presented to estimate the number of clusters on large and high dimensional datasets. The problem of inefficiency, arose by repeatly clustering on large datasets in the traditional approach, is solved in the new method.5. A new classification algorithm with linear time complexity is presented for text categorization, by combining unsupervised subspace clustering methods and supervised classification ones. We apply the proposed methods to network intrusion detection for supervised feature selection and a practical project for malware aided detection.
Keywords/Search Tags:High dimensional data, Subspace clustering, Cluster validity, Information security
PDF Full Text Request
Related items