Classification Learning Of Imbalanced Data Sets Based On Sampling Processing

Posted on:2018-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:X Shang

Full Text:PDF

GTID:2428330548999954

Subject:Computational Mathematics

Abstract/Summary:

PDF Full Text Request

In the information age,data classification is an important research topic,especially for the classification of imbalanced data sets.There are a lot of imbalanced data sets in real life,in the imbalanced data sets,the minority class samples as the number of less,the distribution is relatively rare,and a large number of majority class the sample surrounded,is facing a huge challenge in the classification process.In practical applica-tion,the minority class sample classification errors often costs more.So in the process of classification of imbalanced data sets,how to improve the classification performance of the minority class sample has important significance,also should get more attention.In the aspect of data processing,the over samples algorithm by synthetic minority class samples to achieve the balance of the number of data samples in the data sets.Random over sampling increases the number of minority class samples by simple replica-tion,and improves the classification performance of minority classes to some extent,but lesds to sample overlap and over-fitting.In 2002,Chawla et al.Proposed over-sampling synthetic minority technology SMOTE(Synthetic minority over-sampling technique),the basic idea of SMOTE algorithm:by searching the k nearest neighbor minority samples of minority class samples,several sampling in k nearest neighbor samples are selected at random according to the sampling rate,the linear interpolation in the minority class samples to synthetic minority class samples,improve the overlap and over fitting sam-ples.However,the SMOTE algorithm is the synthesis new samples of all the minority class samples,and the effect of the boundary samples on the classification performance is ignored.In view of the above,Borderline-SMOTE algorithm is proposed based on the SMOTE algorithm by Han et al.The basic idea:using only the minority class samples on the boundary of the synthesis of minority class samples,it can improve the classification accuracy of the minority class samples.But this method in the selection of boundary samples using k nearest neighbor rule,however k nearest neighbor is different affect the selection of boundary samples,there are some limitations.This paper propose a new method to select the boundary samples,DBSMOTE al-gorithm,and propose a new synthetic minority class sample rules.The basic idea of DBSMOTE algorithm:Firstly,calculate the distance between each minority class sample and majority class samples,and calculate the average distance.Secondly,the distance is less than the average distance of minority class samples were selected as the boundary samples.Thridly,using random rule synthetic minority class samples,Finally,the synthesis of new sample and original sample into an new sample set,and the k nearest neighbor classification algorithm is used to model the data.The experimental results show that the algorithm can effectively improve the classification performance of the minority class samples.Since the lack of samples in the data set,the over sampling method and the under sampling method are deficient,over sampling will make the sample data set over fitting,and under sampling method will lose many sample information,the combination method can effectively solve the two problems.Secondly,someone has studied the two kinds of sampling methods,the experimental results show the good effect.In this paper,we com-bine the over sampling Random-SMOTE algorithm with the under sampling algorithm.The experimental results show that the combined algorithm can effectively improve the classification performance of minority class samples.

Keywords/Search Tags:

Imbalanced data sets, SMOTE-algorithm, Random-SMOTE algorithm, Borderline-SMOTE algorithm, combination algorithm

PDF Full Text Request

Related items

1	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
2	Route Based SMOTE Improvement Algorithm L-SMOTE
3	The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets
4	Research On The Application Of Boosting Algorithm Based On Improved SMOTE In Personal Credit Evaluation
5	Improvement And Application Of SMOTE Algorithm
6	Research On Classification Technology For Imbalanced Data Sets
7	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm
8	The Improvement And Application Of Smote Algorithm For Unbalanced Data Sampling
9	Research On Classification Methods For Imbalanced Data
10	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets