Font Size: a A A

Researches On The Classification Of Imbalanced Data With Missing Values

Posted on:2017-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:T HeFull Text:PDF
GTID:2428330590991669Subject:Statistics
Abstract/Summary:PDF Full Text Request
Science and technology in the 21 st century has achieved a rapid development,computer technology standing out among these.Which made mass data storage and processing possible.It is the trend of future development of all walks of life that getting more information for decision via data mining.In the process of data manipulation by data mining,researchers often encounter the problem of imbalanced data set with missing values.Such as in the scenarios of credit card fraud,the data set of fraud actions seems smaller than normal.And it is also easy to miss data during data collection,which leads to the generation of imbalanced data set with missing values.The traditional classification algorithm do not always perform quiet well for since the imbalance and lack of data set.First of all,we give the description of characteristics of imbalanced dataset with missing values and the mainstream method of dealing with related problems.This article promotes improved method for the classification of corresponding imbalanced dataset with missing values.Here is the main works:For the reason that traditional missing data processing method,KNNinterpolation algorithm has K nearest neighbor sparsity on multi-dimension data set and the unstable problem while weighting the inverse of K nearest neighbor distance,we promote a distance formula based on the variable clustering to calculate the limit between samples.And then we give weighted average to the neighborhood using exponential inverse distance formula.We got FC_KNN(Feature cluster KNN)algorithm.Aimed at the shortages of under-sampling,information loss,when dealing with the problem of imbalanced data,we proposed multi-sampling algorithm MS(Multiple Sample)by means of ideological Bootstrap.We sample on majority dataset by multiple sampling,and then we combine minority samples with sampled data to form a plurality of training data set.After that,we train Logistics_Boosting models on each training data set and generate the final model through a linear combination of all the models.At the bottom of the article,we did some test on multiple data sets with different degrees of missing data and imbalance and demonstrate the effectiveness of the algorithm we proposed.
Keywords/Search Tags:data missing, data imbalance, KNN, variable clustering, multiple sampling
PDF Full Text Request
Related items