Font Size: a A A

Supporting data analysis in large scale scientific databases

Posted on:2012-04-20Degree:Ph.DType:Dissertation
University:The Johns Hopkins UniversityCandidate:Wang, XiaodanFull Text:PDF
GTID:1458390011452414Subject:Computer Science
Abstract/Summary:
Continued improvements in physical instruments and data pipelines has lead to an exponential growth in data size in the Sciences. In turn, the Scientific community distributes data geographically at a global scale to facilitate the accumulation of data at multiple sources and relies on application-driven optimizations to manage repositories at a petabyte scale. Moreover, workloads that comb through vast amounts of data are gaining importance in the Sciences. These workloads consist of "needle in a haystack" queries that are long running and data intensive (i.e. non-indexed scans of multi-terabyte tables) so that query throughput limits performance. Queries also join data from geographically distributed sources so that transmitting data produces tremendous strains on the network. Thus, query scheduling needs to be re-examined to overcome scalability barriers and enable a large community of users to explore the resulting, massive amounts of Scientific data.;Toward this goal, this dissertation addresses two crucial bottlenecks that limit large scale data analysis: contention for network and I/O resources from simultaneous analysis by multiple users. Specifically, data analysis at a petabyte scale takes hours or days so that computing resources must be allotted efficiently to prevent a single user from exhausting all available capacity. We first describe algorithms that incorporate network structure in the scheduling of distributed join queries for SkyQuery, a globally-distributed federation of Astronomy databases. The resulting schedules exploit excess network capacity and minimize the utilization of network resources over multiple queries. Next, we put forth a data-driven, batch processing paradigm using the Turbulence Database Cluster as a motivating application. Batch scheduling eliminates redundant I/O and improves throughput in highly contentious environments by exploiting partial overlap in the data accessed by incoming queries. These scheduling optimizations allow Scientific databases to improve the scale of exploration and support more concurrent users.;Our techniques for eliminating redundant I/O among multiple users can also be adapted to large scale data processing systems such as Hadoop, Cassdandra, and BigTable. Instrumenting our algorithms in Astronomy and Turbulence databases reduce network and I/O costs by several fold. We observed similar benefits on the Hadoop/Pig platform for workloads at Yahoo.
Keywords/Search Tags:Data, Large scale, I/O, Scientific, Network
Related items