Font Size: a A A

Multidimensional indexing and management for large-scale databases

Posted on:2002-03-20Degree:Ph.DType:Dissertation
University:State University of New York at BuffaloCandidate:Yu, DantongFull Text:PDF
GTID:1468390011490530Subject:Computer Science
Abstract/Summary:
With the advancement of new technology in many fields, large volumes of scientific data are being generated and archived. The common property of these datasets is that they are high-dimensional vectors. The high dimensionality and tremendous size of such data sets create very challenging problems in classification, management, analysis and retrieval on such data sets. In these applications, the data points must be classified and similar data points should be assigned to the same group. However, the design of efficient and effective classification (Clustering) algorithms for high dimensional datasets remains largely an open problem. Much research has been done in this area. Most existing approaches suffer the so called curse of dimensionality problem. In this dissertation, we introduce WaveCluster+-an efficient algorithm for detecting clusters in very large datasets with high dimensions. By using a hash-based data structure to represent the dataset, the curse of dimensionality can be solved. We offer a detailed technique to cumulatively apply wavelet transforms on the hashed feature space, which is proved very efficient and effective. Clustering organizes data according to similarity, exposing the distribution of a dataset. This facilitates the design of efficient index structures for data management systems. We also introduce the ClusterTree, a new indexing approach for representing clusters generated by any existing clustering approach. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Due to the huge amount of data, the data sets will be saved in the disk arrays (RAID systems). The management of data points at the disk level is a crucial factor for efficiency. We consider the problem of exploiting parallelism of disk arrays to reduce the cost of I/O and accelerate the speed of retrieval. Experiments were set up to evaluate the clustering, indexing and retrieving algorithms proposed in this research.
Keywords/Search Tags:Data, Indexing, Management, Clustering
Related items