Parallel aggregation on multi-dimensional scientific datasets

Posted on:2002-11-10

Degree:Ph.D

Type:Dissertation

University:University of Maryland College Park

Candidate:Chang, Chialin

Full Text:PDF

GTID:1468390011990616

Subject:Computer Science

Abstract/Summary:

The availability of massive scientific datasets and low-cost high-performance commodity computer systems has made data analysis and exploration an increasingly important process in many domains of scientific research. In this dissertation, we examine algorithms for the efficient parallel aggregation of data involving application-specific aggregation functions, a type of processing that is commonly used for analysis and exploration of multi-dimensional dataset. We develop a framework called the Active Data Repository (ADR) with a runtime system that can efficiently support the parallel execution of application-specific aggregation functions over disk-based datasets on a distributed-memory machine with a disk farm. We evaluate the performance of several parallel aggregation algorithms with both synthetic data and data derived from real scientific applications, using the ADR runtime system.; Three of the parallel aggregation algorithms that we study are based on techniques proposed for the parallel execution of reduction operation in scientific applications and traditional approaches proposed for the parallel execution of group-by queries in database systems. Our experiments show that these algorithms do not adapt, and therefore their relative performance depends on the characteristics of the processing and the machine configurations.; We propose two new parallel aggregation algorithms that partition workload based on the cost of the I/O, communication and computation operations such that minimum communication overhead is achieved while maintaining good load balance. The hypergraph-based algorithm models parallel aggregation as a weighted hypergraph and partitions the workload based on a multi-way min-cut for the hypergraph. The cost-based algorithm assigns workload to processors, using a mixture of three workload assignment strategies, such that the estimated cost of the parallel aggregation is minimized. Our experiments show that the proposed algorithms perform well under all tested scenarios, and often outperform all algorithms based on the traditional approaches.

Keywords/Search Tags:

Parallel aggregation, Scientific, Data, Algorithms

Related items

1	Parallel particle tracking algorithms and software for applications in scientific computing
2	Massive Data Aggregation And Parallel Implementation With Complex Constraints
3	Revisiting Aggregation Techniques for Data Intensive Applications
4	Research On Sampling-based Data Aggregation Algorithms In Wireless Sensor Networks
5	Study On Aggregation Patterns Of Foreign Scientific And Technical Reports Of Helicopter Based On FRBR
6	Research On Data Mining Algorithms For Scientific Computational Time Varying Data Sets
7	Study Of Data Stream Statistical Algorithms
8	Research On Wireless Network Data Acquisition And Data Aggregation Algorithms
9	Instrumentation database approach to the analysis of large parallel and object-oriented scientific applications
10	Research On Broadcast And Data Aggregation Algorithms In Wireless Ad Hoc Networks