Font Size: a A A

Efficient and parallel evaluation of XQuery

Posted on:2007-05-31Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Li, XiaogangFull Text:PDF
GTID:1448390005475850Subject:Computer Science
Abstract/Summary:
With the increased popularity of XML, query and processing of XML data has become a very important topic. Most recent work in this area has been in the context of XQuery, which is the XML query language developed by the World Wide Web Consortium (W3C). This dissertation presents our approach for efficient compilation of XQuery queries to facilitate the development of data intensive applications. As XML and XQuery are being used for larger datasets, parallel execution and stream processing are two solutions to reduce storage and/or the execution time. Accordingly, our efforts are focused on optimization of XQuery and generating efficient code in a cluster and streaming environment. Particularly, the issues that we investigate include: (1) Efficient optimization of XQuery by designing new analysis and transformation techniques, as well as integrating existing compiler optimization and query optimization techniques; (2) esigning new techniques toward efficient parallelization of XQuery; (3) Providing high-level abstraction of a dataset to an application developer through XML Schemas and (4) Code generation of XQuery towards the desired targets, such as clusters and streaming environment.; In the area of high-level optimizations, we have developed a new set of optimization and transformation algorithms for XQuery, which are based on a new internal representation that is referred to as Generalized Nested Loop (GNL). These optimization techniques include aggregation rewrite, loop fusion, loop interchange, and aggregation remapping . Since XQuery is a very powerful and complex functional language, to enable the above optimization techniques, we have developed new algorithms to handle arbitrary recursive function and type systems in XQuery. As XML and XQuery are being used for larger datasets, parallelizing XQuery execution can enable faster response. In the area of parallelization, GNL offers a convenient basis for parallelization of XQuery. We present techniques for enumerating parallelization strategies, cost-models for choosing the optimal one, and an algorithm for parallel code generation toward a middleware called ADR. Furthermore, we investigated techniques to parallelize XQuery for native XML datasets on clusters.; To further simplify application development over scientific datasets, we provide a solution by using XML Schemas as a high-level abstraction of a dataset to an application developer. A corresponding low-level Schema describes the actual layout of data and is used by the compiler for code generation. A systematic way for translating the high-level code to a low-level code that achieves high locality and efficient execution is also provided.; For stream processing of XQuery, we have designed the concept of Data Flow Graph and applied a series high-level transformation based on it. The goal of these transformation techniques is to enable a single-pass evaluation strategy for the original query. Based on a SAX parsing engine, we have proposed a new technique to generate efficient code to minimize memory usage.; We have implemented and evaluated the above techniques. Results from several XMark queries and scientific data processing queries show large improvements from new optimizations and good speedups.
Keywords/Search Tags:Xquery, XML, Efficient, Data, Techniques, Processing, New, Optimization
Related items