Font Size: a A A

Parallel aggregation on multi-dimensional scientific datasets

Posted on:2002-11-10Degree:Ph.DType:Dissertation
University:University of Maryland College ParkCandidate:Chang, ChialinFull Text:PDF
GTID:1468390011990616Subject:Computer Science
Abstract/Summary:
The availability of massive scientific datasets and low-cost high-performance commodity computer systems has made data analysis and exploration an increasingly important process in many domains of scientific research. In this dissertation, we examine algorithms for the efficient parallel aggregation of data involving application-specific aggregation functions, a type of processing that is commonly used for analysis and exploration of multi-dimensional dataset. We develop a framework called the Active Data Repository (ADR) with a runtime system that can efficiently support the parallel execution of application-specific aggregation functions over disk-based datasets on a distributed-memory machine with a disk farm. We evaluate the performance of several parallel aggregation algorithms with both synthetic data and data derived from real scientific applications, using the ADR runtime system.; Three of the parallel aggregation algorithms that we study are based on techniques proposed for the parallel execution of reduction operation in scientific applications and traditional approaches proposed for the parallel execution of group-by queries in database systems. Our experiments show that these algorithms do not adapt, and therefore their relative performance depends on the characteristics of the processing and the machine configurations.; We propose two new parallel aggregation algorithms that partition workload based on the cost of the I/O, communication and computation operations such that minimum communication overhead is achieved while maintaining good load balance. The hypergraph-based algorithm models parallel aggregation as a weighted hypergraph and partitions the workload based on a multi-way min-cut for the hypergraph. The cost-based algorithm assigns workload to processors, using a mixture of three workload assignment strategies, such that the estimated cost of the parallel aggregation is minimized. Our experiments show that the proposed algorithms perform well under all tested scenarios, and often outperform all algorithms based on the traditional approaches.
Keywords/Search Tags:Parallel aggregation, Scientific, Data, Algorithms
Related items