Computing Sliding Window Aggregates over Data Streams in a Scientific Workflow System

Posted on:2011-02-25

Degree:M.S

Type:Thesis

University:University of California, Davis

Candidate:Gulati, Supriya Sudhir

Full Text:PDF

GTID:2448390002960545

Subject:Computer Science

Abstract/Summary:

Scientific workflow management is a very useful tool for scientists to create and automate scientific tasks for scientific data management, analysis, simulation, and visualization. Kepler, a scientific workflow management system, is a free, open source software that builds upon Ptolemy II. It supports a design where the components called actors can communicate with one anther via data pipes and schedule execution under different models of computation.;The objective of this thesis is to provide additional capabilities to Kepler in processing continuous data streams. The current approach to doing aggregation in scientific workflows and in Kepler is to use s-aggregations i.e using sequence order only. As we shall see later in more details, using t-aggregation i.e using timestamps can lead to more user-friendly workflows than using s-aggregation. One of the examples is a Growing Degree Day workflow that processes sensor data streams containing hourly temperatures ordered by timestamps. It's aim is to compute daily growing degree day units by counting tuples in every hour in the incoming stream. This is not a very reliable method because we know that rate at which data streams arrive is indeterministic and therefore one cannot, in general, predict the number of tokens present in every window. Another current limitation of Kepler is to compute aggregates over sliding windows where the user has to rerun the workflow for every window. Thus currently Kepler is not very flexible, reliable and user-friendly in performing t-aggregations over sliding windows.;In the context of scientific workflows, we have developed an actor in Kepler that can be used in several workflows to perform t-aggregations. Its input is a data stream which is a sequence of tuples containing the data and the timestamp, and a window stream which is a sequence of tuples containing a start timestamp and an end timestamp. The actor includes the following aggregates: count, sum, average, maximum, minimum, and array (a form of grouping). These are all standard aggregates, except perhaps the array function. This function will group all data values that fill within a t-window into a single array (conceptually: a list). We describe the algorithm and an example demonstrating the working of actor. We also present several scientific case-studies demonstrating the utility of actor in several applications.

Keywords/Search Tags:

Scientific, Data, Workflow, Window, Aggregates, Sliding, Over, Actor

Related items

1	Optimal Data Streams Clustering Algorithm Based On N-δ Sliding Window Model
2	Research And Implementation Of Scientific Big Data Application Execution Optimization Mechanism In Multiple Data Center Environments
3	Estimating Sliding Window-Based Aggregation Queries Over Probabilistic Data Streams
4	Research On Workflow Model For Multi-domain Scientific Data Management And Its Provenance Mechanism
5	Research On Uncertain Data Stream Clustering Method Based On Variable Sliding Window
6	Research On Data Placement Strategy For Scientific Workflow In Cloud
7	Research On Frequent Patterns Mining Algorithm Based Sliding Window In Data Streams
8	Research On Scientific Big Data Query Processing Technology Based On Workflow
9	Research On Scientific Workflow Data Layout Strategy For Cloud Environment
10	The Processing Strategy For Data Streams Based On Sliding Window In Simulation Platform