Font Size: a A A

A Study On SVM Algorithm For Missing Data Processing

Posted on:2018-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:M C ZhuFull Text:PDF
GTID:2428330593451035Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Missing data problem often occurs in data analysis and data mining.Missing data in feature vectors is an important branch of missing data problems.Medical,social survey areas due to their own characteristics,the proportion of missing data is very high.Although these data are missing,but they still contain a lot of valuable information.How to solve the problem of missing data and extract information become a hot research in recent years.The most common method of solving the problem of missing data is imputation,which means that the missing values are filled with a specific value in the pretreatment stage.However,this method is only effective when dealing with low proportion of missing data,and only applies to data which belongs to the MCAR(Missing Completely at Random)or MAR(Missing at Random).In fact,there are a variety of reasons for missing problems,and there is almost no ideal state of MCAR.For different problems,if you do not consider the reasons for the emergence of missing data,it will only distort the original distribution of data or even misleading results.This paper focuses on the problem of missing data in medical and social survey data.After analyzing the reasons for the lack of such features,an improved support vector machine is proposed to deal with missing data.The main innovation is to define a new kernel function that can handle missing data and complete data.To avoid introducing errors,the kernel function take full use of observed data to obtain more information.The sample is re-represented by the distance between the sample and the other samples,rather than directly calculating the value of the missing data.We validate our method on 5 data sets from UCI.Compared with the traditional imputation methods,including class mean,EM,regression,KNN,WKNN imputation methods,the accuracy,F-score,Kappa statistics and recall are used to evaluate the performance.Experimental results show that our method achieve significant improvement in terms of classification results compared with common imputation methods,even when the proportion of missing data is high.We have made improvements to the method,using complete data in the process before the extreme distance computation.The experimental results show that the improved algorithm performs better in continuous data.
Keywords/Search Tags:Missing data, SVM, Classification, Kernel founction
PDF Full Text Request
Related items