Font Size: a A A

The development of bucketing operators and a supporting operator framework for relational database management systems

Posted on:2008-07-31Degree:Ph.DType:Dissertation
University:University of MinnesotaCandidate:Bruso, Kelsey LeeFull Text:PDF
GTID:1440390005477820Subject:Computer Science
Abstract/Summary:
Many researchers rely on relational database management systems (RDBMSs) for storage, retrieval, and analysis of their data. Commercial RDBMSs provide rudimentary pattern matching capabilities to find data items via an exact matching of values, a range of values, or using a pattern string with wild cards. These rudimentary capabilities are insufficient to meet the needs of researchers who wish to perform more sophisticated data analysis. Our focus is in the area of bucketing analysis, where techniques such as clustering, ranking, Ntile, and cross-tabulation associate data items with buckets to provide insight into similarities and differences among the data items. Given the variety of techniques, the researcher is still left with little guidance on how to choose an analysis technique appropriate for his data set or her research questions. After choosing a technique, researchers are still left with the vexing problem of determining how many buckets to use.; The contributions of this work are as follows: (1) We designed and implemented a suite of bucketing operators in the form of PL/SQL extensions to the Oracle DBMS. (2) We have developed a parametric framework to compare and contrast bucketing operators using twenty-five different aspects. (3) From the framework we created three operator taxonomies to help researchers choose an operator appropriate to their data sets and research questions. (4) We created a heuristic, RSQRT, to estimate the number of buckets in a data set and show a high correlation between the number of buckets estimated by RSQRT and the number actually reported by researchers using bucketing analysis. We offer a statistical analysis showing that RSQRT and leading alternate estimating technique, the Bayesian information criteria (BIC), generate the same estimates, with RSQRT being less computationally intensive O(log log n) as compared to BIC O(n2).
Keywords/Search Tags:Data, Bucketing operators, RSQRT, Researchers, Framework
Related items