Font Size: a A A

Scalable model-based clustering algorithms for large databases and their applications

Posted on:2003-10-08Degree:Ph.DType:Thesis
University:Chinese University of Hong Kong (People's Republic of China)Candidate:Jin, HuidongFull Text:PDF
GTID:2468390011987996Subject:Computer Science
Abstract/Summary:
With the unabated growth of data amassed from business, scientific and engineering disciplines, cluster analysis and other data mining functionalities, play a more and more important role. They can reveal previously unknown and potentially useful patterns and relations in large databases. One of the most significant challenges in data mining is scalability—effectively handling large databases with linear computational complexity and limited main memory.; This thesis addresses the scalability problem of the mixture model-based clustering algorithms for databases with large number of data items. It proposes a scalable model-based clustering framework. The basic idea is first to partition a large data set into subclusters roughly, and then to generate clusters from summary statistics of these subclusters effectively. The procedure may be manipulated by a controller. The framework consists of three modules: data summarization, in-memory clustering, and optional controller.; The data summarization module sums up data sets by only storing appropriate summary statistics. Besides the BIRCH'S data summarization procedure, there exist many approaches. Our adaptive grid-based data summarization procedure simply partitions data space and sum up the data items within the same cells into summary statistics. Based on the principle of Self-Organizing Map (SOM), we establish an expanding SOM and an integrated SOM for data summarization and projection. The two SOMs can generate better mappings than the traditional SOM in terms of both the quantization and the topological errors. It is also substantiated by the experiments where they can generate most accurate solutions of the travelling salesman problem so far in the neural network literature.; The in-memory clustering module generates mixture models from summary statistics of subclusters. We establish two model-based clustering algorithms based on the general Expectation-Maximization (EM) algorithm. If attributes are statistically independent, we use a clustering feature to represent a subcluster, and employ EMACF to generate clusters. Otherwise, our EMADS handles data summaries where the correlations are embedded. We prove the convergence of both algorithms. Combining with the grid-based data summarization procedure, we have two scalable clustering systems: the gEMACF and gEMADS algorithms. Similarly, the bEMACF and bEMADS algorithms work on the subclusters generated by the BIRCH's data summarization procedure. They can run one or two orders of magnitude faster than the classical model-based clustering algorithm and generate results with no or little loss of accuracy. Their effectiveness and efficiency are confirmed by experiments on both synthetic and real world data.; The last optional module provides a controller to our clustering system. We introduce genetic algorithms to guide the model-based clustering techniques and establish the GAXEM algorithm. It determines the optimal number of clusters automatically. Some preliminary experimental results substantiate that the GAXEM algorithm performs well on both synthetic and real world data sets.; Besides the clustering application of our model-based techniques, we also briefly discuss two other applications.
Keywords/Search Tags:Data, Clustering, Model-based, Scalable, Summary statistics, SOM
Related items