Probabilistic Graphical Models Based On Data Cleaning

Posted on:2015-03-09

Degree:Master

Type:Thesis

Country:China

Candidate:L Duan

Full Text:PDF

GTID:2268330431967356

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the increase of data used for providing services or decision making, data quality attracts much attention in recent years. Unfortunately,, data quality is often affected by dirty data, such as missing data or incorrect data, which prevalently exists in real world databases. To guarantee the data quality, it is necessary to clean these dirty data, motivating research on data cleaning. There have been some methods used to repair missing data or incorrect data when given domain knowledge, such as integrity constraints, functional dependencies, expert knowledge, etc. However, it is also necessary to clean dirty data when the domain knowledge is not available.Fortunately, the proportion of dirty data in databases of real applications cannot be too large. This means that the dependencies of correct data can be used for cleaning dirty data. It is well known Bayesian network (BN) is an effective framework of representing dependencies among random variables and many effective methods have been proposed for constructing BN from data. Given a BN for representing dependencies of attributes of databases, missing data could be cleaned by using the probability distribution of possible values of missing data obtained by inferences upon the BN. Moreover, some methods have been proposed to represent correlations among input data, intermediate data and output data of a query on databases. For an incorrect result of a query, it is natural to detect errors in the input data based on BN. Therefore, we propose the method for cleaning missing data and incorrect data when domain knowledge is not available. The main contributions of this thesis are as follows:(1) This thesis proposes an efficient dependency analysis method to learn a BN from databases containing missing data, as the basis for cleaning missing values.(2) This thesis proposes an approximate inference method to predict the probability distributions of possible values for the missing data. Then, we update the missing data from the possible value with the largest probability and store the distributions into a probabilistic database for probabilistic query.(3) This thesis proposes a method to construct a BN for representing complex correlations among input data, intermediate data and output data of a query on probabilistic databases.(4) This thesis gives a notion of Blame of nodes in BNs and an approximate method to compute the Blame for detecting errors in databases.(5) This thesis implements the proposed algorithms and makes preliminary experiments to test the feasibility of our method.

Keywords/Search Tags:

Data cleaning, Probabilistic database, Probabilistic graphical model, Bayesian network, Probabilistic inference

PDF Full Text Request

Related items

1	An Improved Probabilistic Database Model And Its Probabilisticn Earest Neighbors Query Research
2	Efficient inference algorithms for some probabilistic graphical models
3	Probabilistic graphical models and variational Bayesian inference in receiver design for MIMO-OFDM systems
4	Probabilistic Graphical Models For Data-intensive Computing Construction Method And Implementation
5	Research On Structures Learning Of Several Probabilistic Graphical Models
6	Image Segmentation With Probabilistic Graphical Model
7	Research On Bayesian Ranking Algorithms Based On Probabilistic Graphical Model
8	A Study Of Social Network Information Filtering Based On Probabilistic Graphic Model
9	Probabilistic Graphical Model And Its Application In Video Segmentation
10	Research On Processing Null Values In Probabilistic Database And Probabilistic Interval Model