Font Size: a A A

Multiple query optimization support for data analysis applications

Posted on:2004-12-15Degree:Ph.DType:Dissertation
University:University of Maryland College ParkCandidate:Andrade, Henrique C. MFull Text:PDF
GTID:1458390011457805Subject:Computer Science
Abstract/Summary:
The efficient storage, management, and manipulation of large datasets is important in many fields of science, engineering and business. Simulations and experimental measurements are the main sources of data in these fields and the amount of data available for analyzing is increasing at a very high pace due both to the increased capability to collect and store data, as well as to the capability for processing it. We broadly define these applications as data analysis applications. Their main characteristic is that they usually access a subset of all the data available—the hot spots—which are the data points of highest interest in generating data products.; In many cases, data analysis is employed in a collaborative environment, where multiple clients access the same datasets and perform similar processing on the data. For instance, in medical training, a large group of students may want to simultaneously explore a similar set of digitized microscopy slides, or visualize the same high resolution Magnetic Resonance Imaging (MRI) results. In this case, the data server needs to process multiple queries simultaneously to minimize latency to the clients.; Previously investigated multi-query optimization (MQO) techniques do not account for user-defined processing of data and user-defined aggregation methods which are typical of data analysis queries. Therefore, the problem we investigate in this dissertation is multiple query optimization for data analysis applications. It can be broadly defined as a set of techniques aimed at minimizing the total cost of processing a series of queries by creating an optimized access plan for the entire set of queries and for reusing previously computed aggregates.; The main goal of our work is to provide a generic optimization framework that can be used as a common platform to deploy data analysis applications that are able to efficiently handle multiple simultaneous queries and can leverage previously computed results to partially or fully compute new queries.; In this work, we show significant improvements in data management issues. These include the integration of an active semantic cache approach coupled with a data transformation model for reusing data and computation, a functional decomposition frame work for exposing reuse sites, query scheduling policies, and cache replacement policies. Finally, we show how all these techniques can be adequately implemented over new computation and execution model paradigms such as clusters of PCs and highly distributed, heterogeneous data grid environments.
Keywords/Search Tags:Data, Multiple, Optimization, Query
Related items