Supporting data analysis in large scale scientific databases

Posted on:2012-04-20

Degree:Ph.D

Type:Dissertation

University:The Johns Hopkins University

Candidate:Wang, Xiaodan

Full Text:PDF

GTID:1458390011452414

Subject:Computer Science

Abstract/Summary:

Continued improvements in physical instruments and data pipelines has lead to an exponential growth in data size in the Sciences. In turn, the Scientific community distributes data geographically at a global scale to facilitate the accumulation of data at multiple sources and relies on application-driven optimizations to manage repositories at a petabyte scale. Moreover, workloads that comb through vast amounts of data are gaining importance in the Sciences. These workloads consist of "needle in a haystack" queries that are long running and data intensive (i.e. non-indexed scans of multi-terabyte tables) so that query throughput limits performance. Queries also join data from geographically distributed sources so that transmitting data produces tremendous strains on the network. Thus, query scheduling needs to be re-examined to overcome scalability barriers and enable a large community of users to explore the resulting, massive amounts of Scientific data.;Toward this goal, this dissertation addresses two crucial bottlenecks that limit large scale data analysis: contention for network and I/O resources from simultaneous analysis by multiple users. Specifically, data analysis at a petabyte scale takes hours or days so that computing resources must be allotted efficiently to prevent a single user from exhausting all available capacity. We first describe algorithms that incorporate network structure in the scheduling of distributed join queries for SkyQuery, a globally-distributed federation of Astronomy databases. The resulting schedules exploit excess network capacity and minimize the utilization of network resources over multiple queries. Next, we put forth a data-driven, batch processing paradigm using the Turbulence Database Cluster as a motivating application. Batch scheduling eliminates redundant I/O and improves throughput in highly contentious environments by exploiting partial overlap in the data accessed by incoming queries. These scheduling optimizations allow Scientific databases to improve the scale of exploration and support more concurrent users.;Our techniques for eliminating redundant I/O among multiple users can also be adapted to large scale data processing systems such as Hadoop, Cassdandra, and BigTable. Instrumenting our algorithms in Astronomy and Turbulence databases reduce network and I/O costs by several fold. We observed similar benefits on the Hadoop/Pig platform for workloads at Yahoo.

Keywords/Search Tags:

Data, Large scale, I/O, Scientific, Network

Related items

1	Towards data-driven large-scale scientific visualization and exploration
2	Optimization and management of large-scale scientific workflows in heterogeneous network environments: From theory to practice
3	Based On The Qr Codes Of Large Scientific Instruments Sharing Remote Monitoring System
4	Perfomance analysis and optimization of large-scale scientific applications
5	High-performance visualization algorithms for large-scale scientific data
6	Advanced Visualization Techniques and Data Representations for Large Scale Scientific Data
7	Research And Implementation Of The Virtual Remote Sharing Mechanism Of Large-Scale Scientific Instrument
8	Enabling dynamic interactions in large scale applications and scientific workflows using semantically specialized shared dataspaces
9	Data mining techniques to enable large-scale exploratory analysis of heterogeneous scientific data
10	Large-scale Scientific Data Mining Density Clustering Algorithm