Researches On The Classification Of Imbalanced Data With Missing Values

Posted on:2017-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:T He

Full Text:PDF

GTID:2428330590991669

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Science and technology in the 21 st century has achieved a rapid development,computer technology standing out among these.Which made mass data storage and processing possible.It is the trend of future development of all walks of life that getting more information for decision via data mining.In the process of data manipulation by data mining,researchers often encounter the problem of imbalanced data set with missing values.Such as in the scenarios of credit card fraud,the data set of fraud actions seems smaller than normal.And it is also easy to miss data during data collection,which leads to the generation of imbalanced data set with missing values.The traditional classification algorithm do not always perform quiet well for since the imbalance and lack of data set.First of all,we give the description of characteristics of imbalanced dataset with missing values and the mainstream method of dealing with related problems.This article promotes improved method for the classification of corresponding imbalanced dataset with missing values.Here is the main works:For the reason that traditional missing data processing method,KNNinterpolation algorithm has K nearest neighbor sparsity on multi-dimension data set and the unstable problem while weighting the inverse of K nearest neighbor distance,we promote a distance formula based on the variable clustering to calculate the limit between samples.And then we give weighted average to the neighborhood using exponential inverse distance formula.We got FC_KNN(Feature cluster KNN)algorithm.Aimed at the shortages of under-sampling,information loss,when dealing with the problem of imbalanced data,we proposed multi-sampling algorithm MS(Multiple Sample)by means of ideological Bootstrap.We sample on majority dataset by multiple sampling,and then we combine minority samples with sampled data to form a plurality of training data set.After that,we train Logistics_Boosting models on each training data set and generate the final model through a linear combination of all the models.At the bottom of the article,we did some test on multiple data sets with different degrees of missing data and imbalance and demonstrate the effectiveness of the algorithm we proposed.

Keywords/Search Tags:

data missing, data imbalance, KNN, variable clustering, multiple sampling

PDF Full Text Request

Related items

1	Research On The Graphical Models In Intelligence Data Processing
2	Neural Network Modeling Of Imbalance Missing Data And Its Application
3	Based On Ensemble Sampling And Data Imbalance Self-adaptive Processing Method In Defect Prediction Context
4	Detection Method For Disease Based On Imbalance Data Classification Model
5	Research On Handling Missing Date Based On Statistical Learning
6	Imbalanced Learning Based On Data-Partition And Sampling Technique
7	Research On The Problem About Unbalanced Data With Balanced Sampling Method
8	Application Of Clustering Based Sampling Algorithms In Unbalanced Data Learning
9	Research On Resampling Methods For Imbalance Data
10	Research On Hybrid Sampling Algorithm Under Denoising In Imbalanced Classification