Font Size: a A A

Probabilistic Graphical Models Based On Data Cleaning

Posted on:2015-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:L DuanFull Text:PDF
GTID:2268330431967356Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increase of data used for providing services or decision making, data quality attracts much attention in recent years. Unfortunately,, data quality is often affected by dirty data, such as missing data or incorrect data, which prevalently exists in real world databases. To guarantee the data quality, it is necessary to clean these dirty data, motivating research on data cleaning. There have been some methods used to repair missing data or incorrect data when given domain knowledge, such as integrity constraints, functional dependencies, expert knowledge, etc. However, it is also necessary to clean dirty data when the domain knowledge is not available.Fortunately, the proportion of dirty data in databases of real applications cannot be too large. This means that the dependencies of correct data can be used for cleaning dirty data. It is well known Bayesian network (BN) is an effective framework of representing dependencies among random variables and many effective methods have been proposed for constructing BN from data. Given a BN for representing dependencies of attributes of databases, missing data could be cleaned by using the probability distribution of possible values of missing data obtained by inferences upon the BN. Moreover, some methods have been proposed to represent correlations among input data, intermediate data and output data of a query on databases. For an incorrect result of a query, it is natural to detect errors in the input data based on BN. Therefore, we propose the method for cleaning missing data and incorrect data when domain knowledge is not available. The main contributions of this thesis are as follows:(1) This thesis proposes an efficient dependency analysis method to learn a BN from databases containing missing data, as the basis for cleaning missing values.(2) This thesis proposes an approximate inference method to predict the probability distributions of possible values for the missing data. Then, we update the missing data from the possible value with the largest probability and store the distributions into a probabilistic database for probabilistic query.(3) This thesis proposes a method to construct a BN for representing complex correlations among input data, intermediate data and output data of a query on probabilistic databases.(4) This thesis gives a notion of Blame of nodes in BNs and an approximate method to compute the Blame for detecting errors in databases.(5) This thesis implements the proposed algorithms and makes preliminary experiments to test the feasibility of our method.
Keywords/Search Tags:Data cleaning, Probabilistic database, Probabilistic graphical model, Bayesian network, Probabilistic inference
PDF Full Text Request
Related items