Research On Query Estimation Techniques On Dirty Database Management System

Posted on:2013-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Yang

Full Text:PDF

GTID:2268330392467980

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the burgeoning growth of the information, havingerroneous, duplicate, uncertain or inconsistent dirty data exists in most databasesystems, which greatly reduces the quality of the data and brings serious losses tothe community. Therefore, new techniques are in demand to process dirty data toreduce its harm. Existing work on processing dirty data are mainly data cleaningand data repairing. However, both data cleaning and data repairing have somelimitations: they cannot clean or repair the dirty data exhaustively, and are generallytime-consuming. So, some researchers propose techniques which perform querieson dirty data directly and obtain query results with clean degree from the dirty data.But these techniques are only applicable to some special queries. In order to bettermanage dirty data, we need a uniform model. The most widely used model is theprobabilistic data model. This model can represent uncertain data, but cannotdescribe the effect of query operations on the quality of the results. Moreimportantly, it will generate all possible world instances in query processing, whichresults in the exponential growth of data size and affect the efficiency of the system.Aiming at the deficiency of these approaches, in this paper, we propose a newmodel: entity-based relational database model. This model can effectively managedirty data. We redefine the traditional query operations, which support queries withthe requirement of data quality. Given the characteristics of this model, thetraditional database implementation techniques are not applicable. Therefore, wefocus on the implementation of query estimation in this paper. First, we proposenew selectivity estimation based on histogram. We propose three new histogramstructures, and they can solve the drawback of existing histograms on entity-basedrelational database, and give good estimates. Then, we propose new similarity joinsize estimation for the entity-based relational database. In this method, we clusterthe similar values using LSH, and sample from the cluster sets to estimate the resultsize. At last, we validate our two estimation algorithms by experiments.

Keywords/Search Tags:

data quality, dirty data, query optimization, query estimation

PDF Full Text Request

Related items

1	Research On Key Technology For Query Optimization On Dirty Database
2	Query Processing On XML Data With Dirty Tags
3	Fuzzing Methods For Query Processing Functionality Of Analytical Databases
4	Research On Techniques And Systems For Index And Query Optimization Of Big Data
5	Research On Distributed Query Optimization And Implementation Of Data Governance Platform
6	Keyword Query For RDF Data Based On Query Translation
7	Research On Sampling Based Aggregate Query Method Of Power Quality Data
8	Query Processing And Optimization Over Various Types Of Streaming Data
9	Research On Distributed Query Processing And Optimization Of RDF Data
10	Design And Realization Of Optimized Query Strategy About Multi-Tenant Saas Based Application