Font Size: a A A

Research On Missing Value Imputation Method Based On Mixed Information System

Posted on:2021-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2518306725952349Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The integrity of data sets will directly affect the quality of data to a great extent.If there are more missing data in the data set,it will increase the complexity of data statistics and reduce the number of effective data in the data set.Although prevention in advance can reduce the missing rate of data to the minimum,but due to the impact of various uncertain factors in life,it cannot be achieved that there is no missing data.So the importance of missing value processing in data sets is self-evident.The characteristics of different types of data are different,but the current research on data imputation does not divide the data into different types,but carries out unified processing,which will greatly affect the work efficiency of data processing and the degree of information acquisition.Therefore,this article mainly starts with different types of missing data sets,and divides the 15 real data sets in UCI into three categories according to their attribute characteristics: only continuous data in data set,only type category contains continuous data sets and the data type and category data sets,with five data sets in each category,and then the data sets of different types of completely random missing,loss rate was5%,10%,20% and 30% loss rate of different data sets.Reuse imputation algorithm of machine learning and statistical imputation algorithms in common use contrast to impute-average imputation method,regression imputation method,expectation maximization imputation method,multiple imputation methods,imputation method based on fuzzy rough set,imputation method based on support vector machine(SVM)algorithm,imputation method based on random forest algorithm and ROUSTIDA imputation method,a total of eight kinds of methods to impute the different types of data.Imputating after get the complete data set then classify three kinds of classification algorithms,based on their classification accuracy rate to evaluate the effect of imputation,at the same time using the mean absolute error and mean square error(MSE)to evaluate the result of the imputation again and finally experiment in only continuous data,the lack of data set of random forests imputation the overall effect is best.The imputation method based on SVM algorithm has the best effect in the missing data set with only category data.In the continuous and categorical missing data sets,the effect of Expectation-Maximization algorithm is the best overall.Then it is pointed out that the optimal imputation method under different types and different miss rates is different,and the optimal imputation method under the same type and different miss rates is also different.The results of this paper are also summarized in detail,and the further research and expansion of the experiment are explained.
Keywords/Search Tags:Hybrid information system, Missing rate, Data set with missing values, Imputation methods, Classification algorithm
PDF Full Text Request
Related items