Font Size: a A A

XML query optimization in Timber

Posted on:2005-11-10Degree:Ph.DType:Thesis
University:University of MichiganCandidate:Wu, YuqingFull Text:PDF
GTID:2458390008486904Subject:Computer Science
Abstract/Summary:
With the growing importance of XML as a format for data representation and data transport, being able to process XML queries efficiently becomes a topic of interest. Relational database techniques, even though mature, don't always apply in an XML context, due to the special features of XML data and XML queries. In this thesis, we study these features and investigate various aspects of query processing and query optimization in the framework of the Timber native XML database system.; We evaluate XML queries by breaking them into pattern trees, which specify the nodes to match, and the desired relationships between the nodes. The structural relationships are evaluated via structural joins. We introduce two families of structural join algorithms (Tree-Marge and Stack-Based), which take advantage of the (DocId, StartPos : EndPos, LevelNum) representation of positions of XML components and perform structural join efficiently, with time, space and I/O complexity linear to the input and output lists.; In this thesis, we focus on cost-based optimization for pattern tree matching. We propose five structural join order selection algorithms and show that fully-pipelined plans are possible for any given query pattern. Moreover, left-deep plans, which are chosen as the rule-of-thumb in a relational database, are not desirable. Relaxing the fully-indexed assumption brings issues related to data instantiation into the picture. The challenge is that in XML databases, the "data associated with an element" can vary. We define data instantiation as a first-class physical operator and take into consideration both the placement and granularity of data instantiation while optimizing pattern tree matching.; Accurate cost-based optimization depends on accurate result size estimation. We propose a novel summary structure, position histogram, to capture the position relationship between nodes in an XML document. Experiments demonstrate that the Ph-Hist algorithm is capable of estimating the result size accurately, for both single twig patterns and patterns that are arbitrarily complex.; Collectively, we are trying to address issues of storing and querying XML data efficiently, in the context of the Timber project. However, the theories behind the techniques proposed in this thesis apply irrespective to how the XML data is stored.
Keywords/Search Tags:XML data, Query optimization, XML queries, Timber, Pattern tree matching
Related items