Font Size: a A A

Geometric methods for mining large and possibly private datasets

Posted on:2007-02-15Degree:Ph.DType:Thesis
University:Georgia Institute of TechnologyCandidate:Chen, KekeFull Text:PDF
GTID:2458390005988805Subject:Computer Science
Abstract/Summary:
With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. This thesis research addresses three important problems in mining large and possibly private datasets. The first problem is to prove the hypothesis that we can use interactive visualization techniques to develop an effective and yet flexible framework for clustering very large datasets, especially those datasets having irregularly shaped clusters. The second problem is to prove the hypothesis that there is effective method for determining the critical clustering structure of categorical data, i.e., finding the best K number of clusters in categorical data. The third problem is to prove the hypothesis that we can develop multidimensional data perturbation techniques that provide high privacy guarantee with little sacrifice of the accuracy of some data mining models. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy.; Some of the most challenging problems in numerical data clustering include identifying irregularly shaped clusters, incorporating domain knowledge into clustering, and cluster-labeling for large amount of disk data. These problems are aggravated when the dataset is huge and the clustering phase is performed on a subset of sampled data. Existing automatic approaches are not effective in dealing with the first two problems, while existing visualization approach does not address the challenges in clustering large datasets. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. With the iVIBRATE approach, we address these problems with the visualization-based three-phase framework: "Sampling - Visual Cluster Rendering - Visualization-based Disk Labeling". The distinct characteristics of the iVIBRATE approach are twofold. (1) We design and develop a VISTA visual cluster rendering subsystem, which invites human into the large-scale iterative clustering process through interactive visualization. VISTA can effectively resolve most of the visual cluster overlapping with interactive visual cluster rendering. (2) We also develop an Adaptive ClusterMap Labeling subsystem, which offers visualization-guided disk-labeling solution that is effective in dealing with outliers, irregular clusters, and cluster boundary extension for large datasets.; There are many categorical data clustering algorithms having been proposed. However, the important problem of identifying the best K number of clusters is not well addressed yet. The second main contribution is the development of "Best K Plot" (BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method has a few unique contributions. (1) The basic method is based on the entropy difference between optimal clustering results with varying Ks. BKPlot can suggest a few candidates for the best K, which identify different layers of critical clustering structures, respectively. (2) We also developed the sample BKPlot theory for characterizing the critical clustering structures in very large categorical datasets. (3) The basic BKPlot method and the sample BKPlot method are extended to characterize the feature of no-cluster datasets, which is then used to identifying the exist...
Keywords/Search Tags:Data, Large, Method, Mining, Clustering, Visual cluster rendering, Prove the hypothesis, Possibly
Related items