Font Size: a A A

Research On Missing Value Imputation Techniques In Complicated Applications

Posted on:2019-08-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q MaFull Text:PDF
GTID:1488306350971849Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of the information age,data is explosively generated.However,in many real-world applications,data is collected with no guarantee on data quality and credibility.There are missing values(MVs)in many real-world datasets,even though those sources of the datasets,e.g.,weather data and medical data,are considered as highly reliable.On the other hand,many data analytics applications,e.g.,machine learning,pattern matching and data mining do not handle the datasets with MVs well.Therefore,MV imputation is a critical preprocessing means for data mining.The commonly used and traditional MV imputation approaches can be classified into nearest neighbor imputation(NNI)and regression imputation(RI).The NNI imputation estimates MVs in an incomplete observation by all the known values of its neighbors,while the RI imputation estimates the MVs by modeling the correlations between the incomplete attributes and complete attributes in the same observation.However,both existing NNI and RI algorithms are the general models designed for imputing MVs in common data.There are various limits when they are used for imputing MVs in the distinctive data,e.g.,streaming data and nonlinear correlated data,and thus the imputation accuracy is unsatisfactory.In this dissertation,we design various MV imputation approaches which are specific to various distinctive data generated from complex real-world applications.The major contributions are summarized as follows.Firstly,an order-sensitive imputation framework is proposed for imputing clustered missing values.Through analyzing a number of real-world datasets,we find that missing values are prone to occur together.We refer the phenomenon where missing values occur intensively together as a clustered missing values(MVs)phenomenon.Due to the existence of clustered MVs phenomenon,the imputation results are unsatisfactory.To address this issue,we propose a new framework,namely Order-Sensitive Imputation for Clustered Missing values(OSICM),tailored for MV imputation under the clustered MVs phenomenon.In OSICM,we first formulate the problem of searching the optimal imputation order,prove its NP-hardness,and introduce an algorithm based on dynamic programming to compute the optimal solution.Second,we study the hardness of approximation and devise a pseudo-linear heuristic algorithm that facilitates a trade-off between effectiveness an deficiency of MV imputation.To further accelerate the search for a good imputation order,we also develop a linear-time greedy algorithm.Finally,we present an extensive experimental evaluation on both real and synthetic datasets.The results demonstrate that the proposed OSICM framework achieves significantly higher imputation accuracy than existing methods,while obtaining good scalability.Secondly,a new deep learning model called Missing Data Imputation denoising Autoencoders(MIDIA)is proposed for imputing MVs in nonlinear correlated data.The strong nonlinear correlations amongst attributes have been found in many realworld datasets,e.g.,gene data and image data.Nevertheless,most existing works explore the linear regression model to impute MVs by modeling linear correlations between incomplete attributes and complete attributes in the same observation.Obviously,it is not reasonable using the linear regression model to exploring the nonlinear correlations amongst attributes of the data.Motivated by the dAE model,we propose a new deep learning model,called MIDIA,specifically designed for MV imputation.Based on MIDIA model,we aim to capture the nonlinear correlations between the incomplete attributes and complete attributes in order to return an effective MV estimation for imputation.Moreover,the MIDIA is a MV-driven model,i.e.,the distribution of MVs in training dataset should be similar to that in testing dataset(the target dataset with MVs).For two scenarios,i.e.,MVs concentrate in one or a few attributes and MVs occur across most of attributes,we propose specific MV imputation approaches,i.e.,MIDIA-single and MIDI A-whole,respectively.Finally,the effectiveness of the proposed approaches is demonstrated by an extensive experimental evaluation on real-world datasets.Thirdly,a real-time and error-tolerant missing value imputation approach,namely REMAIN,is proposed for poor-quality streaming data.As in many realworld applications,data is collected with no guarantee on data quality.Besides missing values,various anomalies also exist in the data.On the other hand,the data usually arrives sequentially and continuously,which requires real-time processing.We consider continuously arriving data containing missing values and a significant percentage of anomalies as poor-quality streaming data.In this part,we propose the REMAIN(Real-time and Error-tolerant Missing vAlue ImputatioN)approach which imputes MVs in poor-quality streaming data with polynomial time and constant space.First,by considering the existence of anomalies in the data,the MV imputation model is initialized utilizing data free of anomalies.Second,we propose an incremental approach for parameter update with continuously arriving of the data at each time point.Additionally,considering the scenario where the correlations amongst attributes of the data change abruptly,we devise a deterioration detection mechanism.Through estimating the variance of imputation error at each time point,we capture the deterioration points effectively,and re-estimate the model parameters at deterioration points.Third,we propose an efficient parameter re-estimation algorithm since the RANSAC used for parameter initialization is time costly for large-scale datasets due to iterative computation.Finally,we present an extensive experimental evaluation using both and synthetic datasets.The results demonstrate that the proposed REMAIN approach achieves significantly higher imputation accuracy than exiting works.Moreover,compared to the state-of-the-art existing MV imputation methods,REMAIN with efficient parameter re-estimation obtains up to one order of magnitude improvement in scalability.Finally,a smoothing-sensitive curve recovery algorithm to reconstruct the physical world with high precision is proposed for wireless sensor networks(WSNs).With the rapid development of sensing devices,the wireless sensor networks are widely used in many real-life applications,e.g.,marine environment monitoring and crop growth monitoring,to observe the complicated physical world.As we know,the data sampled by sensors is discrete,while the change of physical world is continuous and smooth.It is insufficient to show the variation of the physical world only using the discrete data points because it may overlook critical data points(e.g.extreme points and inflection points).Thus,it is critical to construct a smooth curve based on the discrete data points to describe the continuous change of physical world.Intuitively,if we consider the observed discrete observations as the known points,while the other points in the curve as the missing values,the reconstruction of real physical world with high-precision can be considered as a special case of MV imputation problem.To address this issue,we propose a smoothing-sensitive approximate physical world curve recovery algorithm.We first construct an approximate curve using the data sampled by sensors based on the existing physical-world-aware data acquisition algorithms.Second,we propose a curve smoothing algorithm to improve the smoothness of the approximate curve from 1-order continuous to 2-order continuous As a result,more inflection points can be obtained from the precise and smooth curve.Third,we present an energy-efficient data source selection scheme which takes the remaining energy of each sensor and the spatial correlations among sensors into consideration.By selecting a subset of sensors to transmit their data,the lifetime of the WSN is lengthened.Finally,the experimental results,using both real-world and synthetic datasets,demonstrate the effectiveness of the proposed algorithms.In summary,this dissertation focus on the problem of missing value imputation under complicated applications.For clustered MVs data,nonlinear correlated data,poor-quality streaming data and sensor data,we propose various effective solutions for MV imputation,respectively.Theoretical analysis and extensive experimental results show that our proposed solutions apparently outperform those related works.
Keywords/Search Tags:MV imputation, clustered MVs data, nonlinear correlated data, poor-quality streaming data, physical world recovery
PDF Full Text Request
Related items