Data processing and workflow scheduling in cluster computing systems

Posted on:2009-12-27

Degree:Ph.D

Type:Dissertation

University:The University of Wisconsin - Madison

Candidate:Shankar, Srinath

Full Text:PDF

GTID:1448390002490685

Subject:Computer Science

Abstract/Summary:

The data explosion in the scientific and commercial domains has led to a renewed interest in the use of cluster computing for parallel data processing. Broadly, cluster computing systems can be split into three classes batch computing systems, MapReduce and parallel database management systems.;All of these systems have their advantages and disadvantages. Condor, a popular batch computing system, can be used in a wide variety of domains due to its ability to execute user-specified workflows. However, it lacks support for distributed data management and the ability to parallelize data-intensive applications. On the other hand, MapReduce and parallel database systems have the ability to automatically parallelize user applications. Both have some support for distributed data management, but are more specialized in the kinds of applications that can be run.;In this document, we demonstrate the benefits of distributed data management in Condor and outline a data-aware workflow scheduling mechanism that is suited to data-intensive applications. These features are also implemented in Clustera, a new cluster management system. Central to the architecture of Clustera is a application server backed by a database that stores the operational information that is used and produced in the cluster.;Despite their differences, all three classes of cluster computing systems execute workflows consisting of tasks with data dependencies over a cluster of machines. In Clustera, we bridge the gap between these systems by building a logical data layer and abstract workflow layer over the core functionality of distributed data management and workflow execution. Thus, while Clustera is flexible enough to execute workflows from a wide variety of domains, it also possesses the ability to automatically parallelize and execute applications expressed in high-level terms, such as a SQL query over relational tables, or a MapReduce workflow.

Keywords/Search Tags:

Data, Cluster, Workflow, Systems, Applications, Execute

Related items

1	Toward practical multi-workflow scheduling in cluster and grid environments
2	An integration architecture for large scale Web applications involving workflow, data exchange, and knowledge bases
3	Workflow Management Technology For Enterprises
4	Research On Service-oriented Workflow System Application
5	Xml-based Database Workflow Systems Research And Applications
6	Research On Workflow Management Technology
7	Cluster Support Workflow Management System Cluster-synchroflow, Design And Implementation
8	Research And Implementation Of High Reliability Data Integration System Based On Cluster
9	Research On Pharmacy Management System Application Based On Workflow
10	Research And Implementation Of Scientific Big Data Application Execution Optimization Mechanism In Multiple Data Center Environments