Mining and using coverage and overlap statistics for data integration

Posted on:2005-10-18

Degree:Ph.D

Type:Dissertation

University:Arizona State University

Candidate:Nie, Zaiqing

Full Text:PDF

GTID:1458390008982355

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Query processing in the context of integrating autonomous data sources on the Internet has received significant attention of late. In contrast to traditional query processing scenarios, in which each relation is stored in the same primary database and in which completeness of answers is expected by users, data integration scenarios involve handling relations that are stored across multiple and potentially overlapping sources and dealing with conflicting objectives in terms of what coverage of answers users want and how much execution cost they are willing to bear for achieving the desired coverage. Hence, query processing in data integration requires coverage and overlap statistics about these autonomous sources to generate optimal query plans. This dissertation first presents StatMiner, an effective statistics mining approach which automatically generates attribute value hierarchies, discovers frequently accessed query classes, and learns coverage and overlap statistics only with respect to these classes. The dissertation then introduces Multi-R, a multi-objective query optimizer which uses coverage and overlap statistics to support joint optimization of coverage and cost of query plans. The efficiency of StatMiner and the effectiveness of the learned statistics are demonstrated in the context of BibFinder, a publicly available bibliography mediator developed as a testbed for this work. The empirical evaluation of Multi-R also shows that the generated query plans are significantly better than the existing approaches, both in terms of planning cost and in terms of plan execution cost.

Keywords/Search Tags:

Coverage and overlap statistics, Data, Query, Cost

PDF Full Text Request

Related items

1	Research On Set T Coverage Query Algorithm Based On Inverted Index
2	The Research On PostgreSQL Statistics Estimation Based On Block-level Sampling
3	The Research On Postgresql Statistics Estimation Based On Block-level Sampling
4	Research On The Query Optimization Technology Under Automatic Summary Table Which Is Based On Statistics Process Graph
5	The Design And Implementation Of Students Comprehensive Query And Statistics System
6	Self-Service Data Extraction System For Big Data Platform
7	Research On Reachability Query Coverage Over Large Graphs
8	Key Technologies Research On Network Security Monitoring Data Streams
9	Design And Implementation Of The Application System For Guangzhou Panyu District Bureau Of Statistics
10	Cognitive Radio Spectrum Sensing Based On Overlap FFT And Blind Channel Estimation Techonlogy