Font Size: a A A

Computing Sliding Window Aggregates over Data Streams in a Scientific Workflow System

Posted on:2011-02-25Degree:M.SType:Thesis
University:University of California, DavisCandidate:Gulati, Supriya SudhirFull Text:PDF
GTID:2448390002960545Subject:Computer Science
Abstract/Summary:
Scientific workflow management is a very useful tool for scientists to create and automate scientific tasks for scientific data management, analysis, simulation, and visualization. Kepler, a scientific workflow management system, is a free, open source software that builds upon Ptolemy II. It supports a design where the components called actors can communicate with one anther via data pipes and schedule execution under different models of computation.;The objective of this thesis is to provide additional capabilities to Kepler in processing continuous data streams. The current approach to doing aggregation in scientific workflows and in Kepler is to use s-aggregations i.e using sequence order only. As we shall see later in more details, using t-aggregation i.e using timestamps can lead to more user-friendly workflows than using s-aggregation. One of the examples is a Growing Degree Day workflow that processes sensor data streams containing hourly temperatures ordered by timestamps. It's aim is to compute daily growing degree day units by counting tuples in every hour in the incoming stream. This is not a very reliable method because we know that rate at which data streams arrive is indeterministic and therefore one cannot, in general, predict the number of tokens present in every window. Another current limitation of Kepler is to compute aggregates over sliding windows where the user has to rerun the workflow for every window. Thus currently Kepler is not very flexible, reliable and user-friendly in performing t-aggregations over sliding windows.;In the context of scientific workflows, we have developed an actor in Kepler that can be used in several workflows to perform t-aggregations. Its input is a data stream which is a sequence of tuples containing the data and the timestamp, and a window stream which is a sequence of tuples containing a start timestamp and an end timestamp. The actor includes the following aggregates: count, sum, average, maximum, minimum, and array (a form of grouping). These are all standard aggregates, except perhaps the array function. This function will group all data values that fill within a t-window into a single array (conceptually: a list). We describe the algorithm and an example demonstrating the working of actor. We also present several scientific case-studies demonstrating the utility of actor in several applications.
Keywords/Search Tags:Scientific, Data, Workflow, Window, Aggregates, Sliding, Over, Actor
Related items