Font Size: a A A

Research On Improved Bayesian Methods For Replacing Missing Data

Posted on:2012-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:X ShenFull Text:PDF
GTID:2178330338997412Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays the scale of dataset in various sectors is facing an explosive increase due to the rapid development of information technology and data acquisition method. By using data mining, some potential but valuable information can be found from raw data for further analysis and utilization. The original data usually have some different types of quality problem while most researches and models are based on ideal data but not real. And the missing data is one of the thorny quality problems pressing for solution.Currently, the main solutions of data missing problem include: deletion, constant replacement, statistics replacement, simple value replacement and complex value replacement. Comparing with other methods, the complex value replacement usually has a more desirable completion result because the predictive value is calculated by combining with known data. This paper focuses on the simple Bayesian approach which is widely used for estimating and replacing missing categorical data. However, the simple Bayesian approach discards the relationship between the attributes of the data and will cause inaccurate predictive value and error.This paper proposes the dual-scale Bayesian (DSB) approach which modifies the simple Bayesian approach to overcome its shortage in data estimating and replacing. With this approach, the effect of relationship between attributes is fully considered and the predictive value of missing data belonging to a certain category is obtained with two posterior probabilities calculated by the simple Bayesian method and a correction factor to combine them. Then the MaxPost and PropPost methods are used to replace the missing data. The MaxPost method replaces the missing value with the maximum probability and the PropPost method uses a value that is selected with probability proportional to the estimated posterior distribution.Our experiments based on four UCI datasets which come from different area. By using DSB method proposed in this paper, a predictive value is calculated and assigned to each missing data as to complete it more reasonably. The results with three different evaluating indicators shows that the algorithm proposed in this paper is more effective and accurate than the simple Bayesian method.
Keywords/Search Tags:classification data, missing data, DSB, correction factor
PDF Full Text Request
Related items