Multidimensional indexing and management for large-scale databases

Posted on:2002-03-20

Degree:Ph.D

Type:Dissertation

University:State University of New York at Buffalo

Candidate:Yu, Dantong

Full Text:PDF

GTID:1468390011490530

Subject:Computer Science

Abstract/Summary:

With the advancement of new technology in many fields, large volumes of scientific data are being generated and archived. The common property of these datasets is that they are high-dimensional vectors. The high dimensionality and tremendous size of such data sets create very challenging problems in classification, management, analysis and retrieval on such data sets. In these applications, the data points must be classified and similar data points should be assigned to the same group. However, the design of efficient and effective classification (Clustering) algorithms for high dimensional datasets remains largely an open problem. Much research has been done in this area. Most existing approaches suffer the so called curse of dimensionality problem. In this dissertation, we introduce WaveCluster+-an efficient algorithm for detecting clusters in very large datasets with high dimensions. By using a hash-based data structure to represent the dataset, the curse of dimensionality can be solved. We offer a detailed technique to cumulatively apply wavelet transforms on the hashed feature space, which is proved very efficient and effective. Clustering organizes data according to similarity, exposing the distribution of a dataset. This facilitates the design of efficient index structures for data management systems. We also introduce the ClusterTree, a new indexing approach for representing clusters generated by any existing clustering approach. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Due to the huge amount of data, the data sets will be saved in the disk arrays (RAID systems). The management of data points at the disk level is a crucial factor for efficiency. We consider the problem of exploiting parallelism of disk arrays to reduce the cost of I/O and accelerate the speed of retrieval. Experiments were set up to evaluate the clustering, indexing and retrieving algorithms proposed in this research.

Keywords/Search Tags:

Data, Indexing, Management, Clustering

Related items

1	Web crawler indexing: An approach by clustering
2	Research And Application Of Clustering Algorithm In Image Indexing
3	Research And Realization Of Network Consensus Monitor System Based On The Incremental Text Mining
4	Research On Document Clustering Technology Based On Latent Semantic Indexing
5	Research On Strategies Of Indexing In Dataspace
6	External Data Access and Indexing in a Scalable Big Data Management System
7	Research Of Indexing Technology In Medical Image Retrieval Platform
8	On indexing large databases for advanced data models
9	Clustering and indexing methods for high dimensional data and moving objects
10	Research On Hierarchical Indexing Technique For Large Image Database