Font Size: a A A

A Result Size Estimation Algorithm For Value Predication In XML Query

Posted on:2009-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2178360278457600Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, XML (Extensible Markup Language) has become new standard of data representation as well as data exchange on the Internet.Though with sound achievements of XML research, there are in theories and performances still many difficulties for XML query technology because of its inherent characteristics. With a profound research in XML query technology, this dissertation by analyzing and summaring the current research, development and application analyzes detailed the optimization of XML query in the following aspects: XML data model, the memory of XML data in data-base, the analyzing of XML data and process method of query. At present, a variety of XML data query methods had been put forward, but inadequate consideration still exist in the complex XML data distribution, which brings about low performance efficiency. This article, which elaborates the query estimation technology in detail from both one-dimensional and multi-dimensional aspects and takes XML characteristics into consideration, proposes using mulit-dimental histogram to count XML dates in order to simplify performance.The value distribution of XML involves not only the distribution of other values but also the structural information of XML which will lead to multi-dimensional dependent element set if structural information itself is complex. In that case, storage and error rate will raise a lot. Therefore, this paper, using discrete cosine conversion methods (DCT) to deal with XML data, expands the DCT to high-dimensional model basing on the high correlation of XML data the expansion of, which brings about a high-dimensional DCT equation. Such an algorithm proves to be efficient in reducing both the error in statistics and processing time and memory.A proposal of certain method needs careful and comprehensive experiment validation. In the experiment, all data are generated in the (0,l)n normalized data space. Besides, synthetic datas are generated with 50K records which ranged from 2 to 10 dimensions. We generate data with various distributions (1) Normal distribution; (2) Zipf distribution; (3) Clustered distribution to verify (1) The Storage Requirements and Selectivity Estimation Time;(2) Effect of Dimension and Query Size;(3) Effect of Data Distributions. Extensive experiments showed the proposed method is superior to the previous ones with the following advantages:1) The previous methods could not support multi-dimensional selectivity estimation, particularly, more than three dimensions. But our method supports high dimensional selectivity estimation with high accuracy.2) Our method can save time and space.3) Our method eliminates the periodical reconstruction of the statistics for estimating the selectivity because it can reflect dynamic data updates to the statistics immediately.4) Our method simply calculates the selectivity using the integral of cosine functions. It also calculates the estimation accurately because it naturally supports the interpolation between the adjacent buckets.
Keywords/Search Tags:XML, Result Size Estimation, Discrete Cosine Transform
PDF Full Text Request
Related items