Font Size: a A A

Research On Query Estimation Techniques On Dirty Database Management System

Posted on:2013-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2268330392467980Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the burgeoning growth of the information, havingerroneous, duplicate, uncertain or inconsistent dirty data exists in most databasesystems, which greatly reduces the quality of the data and brings serious losses tothe community. Therefore, new techniques are in demand to process dirty data toreduce its harm. Existing work on processing dirty data are mainly data cleaningand data repairing. However, both data cleaning and data repairing have somelimitations: they cannot clean or repair the dirty data exhaustively, and are generallytime-consuming. So, some researchers propose techniques which perform querieson dirty data directly and obtain query results with clean degree from the dirty data.But these techniques are only applicable to some special queries. In order to bettermanage dirty data, we need a uniform model. The most widely used model is theprobabilistic data model. This model can represent uncertain data, but cannotdescribe the effect of query operations on the quality of the results. Moreimportantly, it will generate all possible world instances in query processing, whichresults in the exponential growth of data size and affect the efficiency of the system.Aiming at the deficiency of these approaches, in this paper, we propose a newmodel: entity-based relational database model. This model can effectively managedirty data. We redefine the traditional query operations, which support queries withthe requirement of data quality. Given the characteristics of this model, thetraditional database implementation techniques are not applicable. Therefore, wefocus on the implementation of query estimation in this paper. First, we proposenew selectivity estimation based on histogram. We propose three new histogramstructures, and they can solve the drawback of existing histograms on entity-basedrelational database, and give good estimates. Then, we propose new similarity joinsize estimation for the entity-based relational database. In this method, we clusterthe similar values using LSH, and sample from the cluster sets to estimate the resultsize. At last, we validate our two estimation algorithms by experiments.
Keywords/Search Tags:data quality, dirty data, query optimization, query estimation
PDF Full Text Request
Related items