Font Size: a A A

Research On Missing Value Imputation Of Incomplete Data

Posted on:2014-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:L JinFull Text:PDF
GTID:2268330422450602Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Missing value has been an inevitable problem that cannot be overlooked in theresearch field as well as in the industrial. With the data collection process transformedfrom manual to the machine, the expansion of data size has made the quality of thesedata even harder to guarantee, of which missing data is a big issue. Missing value isthe result of many factors such as wrong measurements, the limitation of datacollection, human fault during data importing or failed insertion due to rule violation,which make it more difficult to prevent. From a report of the Honeywell Corporationin US, the missing rate of their data for the maintenance of equipment and testing wasup to50%. There is no surprise for this when comparing to the other application fieldslike medicine, whose unstandardized collection method increases its missing dataeven more, say above60%.Missing value is not just a problem of information loss; it will affect thefollowing data mining or statistical analysis. Common methods that used to deal withmissing value include deleting incomplete records, treating missing value as a specialvalue or filling these “blanks”. Apparently, the third method is better from the point ofdata quantity and quality. Many algorithms have been proposed for missing valueimputation. Although these algorithms perform well on different applications, thereare still some limitations. For example, some models like decision tree have to bespecified the class attribute and conditional attributes, which has to be rebuild whendealing with different “classes”. Secondly, some algorithms have difficulties inhandling high dimensional dataset. Involving irrelevant attributes not only slowsdown the processing speed, but may affect the final result. Thirdly, there is nomeasurement for imputation quality without truth-value for comparison. Lastly, manyalgorithms can only handle small dataset, which is far below the demand for largerdataset. To solve the above problems, we proposed an imputation algorithm usingBayesian network and probabilistic reasoning. Different from the common usedconstruction algorithm of Bayesian Network, we build the network by discovering thecorrelation between attributes. The construction algorithm run without target attribute;the most influential attribute comes to the root naturally. This is almost an automatedprocess which needs little prior knowledge from the user. The conditional independentassumption of Bayesian network can be used to decompose the computation of jointprobability, which helps to reduce the complexity when dealing with high dimensionaldataset. After the Bayesian network has been build, we run the probabilistic reasoningprocess to complete the imputation task. The probabilities computed from reasoningcan reflect the imputation accuracy. Attributes with higher imputation probabilitieshave higher accuracies. To deal with mixed attributes, our algorithm takes continuousattributes into consideration. For large dataset, we use parallel techniques to speed upthe execution time. Here, we parallelize our imputation algorithm into Map-Reduceframework. The experimental results verified the effectiveness of construction algorithm of Bayesian Network and probabilistic reasoning; we also gave theimputation accuracy by comparing it with common used methods. For the parallelalgorithm, we tested the parallel performance and analyzed the factors affecting it.
Keywords/Search Tags:missing value imputation, Bayesian Network, probabilistic reasoning, Map-Reduce
PDF Full Text Request
Related items