Font Size: a A A

Data processing and workflow scheduling in cluster computing systems

Posted on:2009-12-27Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Shankar, SrinathFull Text:PDF
GTID:1448390002490685Subject:Computer Science
Abstract/Summary:
The data explosion in the scientific and commercial domains has led to a renewed interest in the use of cluster computing for parallel data processing. Broadly, cluster computing systems can be split into three classes batch computing systems, MapReduce and parallel database management systems.;All of these systems have their advantages and disadvantages. Condor, a popular batch computing system, can be used in a wide variety of domains due to its ability to execute user-specified workflows. However, it lacks support for distributed data management and the ability to parallelize data-intensive applications. On the other hand, MapReduce and parallel database systems have the ability to automatically parallelize user applications. Both have some support for distributed data management, but are more specialized in the kinds of applications that can be run.;In this document, we demonstrate the benefits of distributed data management in Condor and outline a data-aware workflow scheduling mechanism that is suited to data-intensive applications. These features are also implemented in Clustera, a new cluster management system. Central to the architecture of Clustera is a application server backed by a database that stores the operational information that is used and produced in the cluster.;Despite their differences, all three classes of cluster computing systems execute workflows consisting of tasks with data dependencies over a cluster of machines. In Clustera, we bridge the gap between these systems by building a logical data layer and abstract workflow layer over the core functionality of distributed data management and workflow execution. Thus, while Clustera is flexible enough to execute workflows from a wide variety of domains, it also possesses the ability to automatically parallelize and execute applications expressed in high-level terms, such as a SQL query over relational tables, or a MapReduce workflow.
Keywords/Search Tags:Data, Cluster, Workflow, Systems, Applications, Execute
Related items