Font Size: a A A

Research On XML Indexing And Query Processing

Posted on:2010-08-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Q WangFull Text:PDF
GTID:1118360278495869Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Extensable markup language (XML) is a specification created by W3C organization to help developer to organize information freely. It aims to not only fulfill the need of increasing network application, but also provide well reliability and interaction when coorperating with network. XML has four features, namely well formated, extensable, highly structured and convient for network transmission. These features help XML gains extrodenary performance.Recently, there are many techniques on XML data storage and query. But these techniques can not satisfy the requirement of high efficiency on data processing, and usually these techniques do not use index or process queries based on database systems. So the performance of processing large or complex structured XML data is bad.In this paper, we aim to process XPath query on XML efficiently. We propose several labeling scheme suitable for XML tree structure, and organize them as index. Based on the labeling scheme, we propose high efficient algorithms to support XPath query, the main contribution of this paper are:(1) We propose ClusteredAbsolutePath (CAP for brief) labeling scheme and corresponding algorithms. A CAP is defined based on the position of a child node to its parent node in XML definition. So the CAP labeling scheme supports position predict in XPath natually. A CAP can group nodes with similar position and then boost query processing. Based on CAP labeling scheme, we propose BranchFilter and RelatedPathJoin algorithms to process query efficiently. Experiments show that these two algorithms outperform TwigStack algorithm and have good scalability.(2) We propose ClusteredChainPath (CCP for brief) labeling scheme and corresponding algorithms. A CCP groups nodes with same source path and their ancestors as a tree. Nodes on the tree with same depth are bi-similar. All leaf nodes of these trees is a partition of XML nodes. Then the index structure of CCP is the same of 1-Index. Because the ancestors are saved, we can quickly get the ancestor set of a given node set to boost query processing. Based on CCP labeling, we propose RelatedPathJoin algorithm to process query efficiently. Experiments show that algorithm RelatedPathJoin outperforms TwigStack algorithm and has good scalability.(3) We propose SourcePathTree labeling (SPT for brief) by using node region to represent continious node sequence. We optimize the data storage to make the I/O cost minimum. We design algorithms to group related path before join, then only SPT with same related path can be joined, otherwise, they are abandoned without being loaded into memory. Experiments shows that the refined labeling scheme takes smaller storage, smaller I/O cost and better performance.(4) We propose PrimeSequence labeling scheme and corresponding algorithm. PrimeSequence labeling scheme can be used to create an F&B-Index and support query processing efficiently. In PrimeSequence labeling scheme, a node is labeled as a string of primes or production of several distinct primes, if two nodes have the same labeling, they are forward and backward bi-similar. We also propose the algorithm to create region encoding and CCP encoding on F&B-Index, then we can run TwigStack and RelatedPathJoin on F&B-Index. An algorithm based PrimeSequence labeling scheme called DivisionMatch is proposed for query processing, the algorithm can fast confirm subtree structure of a given node. We compare the three algorithms on several dataset and conclude that DivisionMatch algorithm based PrimeSequence labeling scheme performs better than TwigStack and RelatedPathJoin on F&B-Index.
Keywords/Search Tags:XML, Indexing, Query Processing, Clustered Absolute Path, Clustered Chain Path, Prime Sequence
PDF Full Text Request
Related items