Font Size: a A A

Research On Classification Method Of Imbalanced Data Sets

Posted on:2020-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:S L LiuFull Text:PDF
GTID:2428330575956642Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Classification plays an important role in data mining.It helps us find valuable underlying information.However,data imbalance usually affects the classification performance.Faced with imbalanced data sets,traditional algorithms always are difficult to guarantee the classification effect of the minority.Thus,it is necessary to study the classification method of imbalanced data sets.In view of data sampling,feature selection and classification algorithm,this paper proposes algorithms to improve the classification performance of imbalanced data sets.(1)As for sampling in data pre-processing,feasible resampling before classification of imbalanced data sets may exert positive effect.To overcome the defect that Borderline-SMOTE oversampling algorithm ignores the optimization of majority,the thesis presents the RENN-BSMOTE resampling algorithm for imbalanced data.By analysing the class distribution of the nearest neighbors,samples with poor similarity to neighbors are deleted iteratively.Besides,Borderline-SMOTE algorithm is used to oversample the minority.The superiority of the new RENN-BSMOTE algorithm are fully illustrated by comparing with other algorithms according to experiments.(2)In terms of feature selection,proper feature selection of data generally has a good impact on the classification.Since Relief feature selection methods neglects the imbalanced characteristics of data,this thesis suggests the improved DK-ReliefF algorithm.It sets different number of nearest neighbors belong to the same and different class,and update the weight vector of feature according to the distance between the samples and their neighbors.Finally,the features with high distinguishing ability are selected to cooperate with the classification.By comparing the DK-ReliefF algorithm with other algorithms,we can know that the DK-ReliefF algorithm effectively improve the classification perfonnance.(3)In terms of classification algorithm,KNN is a widely used classification algorithms.One ofthe problems KNN faces is how to decide the appropriate number of the nearest neighbors.Based on the previous work,a new PTM-DWKNN classification algorithm for selecting the optimal local k value is proposed to suit imbalanced data sets.The optimal local A value is obtained due to the local characteristics of samples.Besides,thinking over the different influence of the neighbors according to their different distance to the tested sample,the object samples are classified by weighted voting.Finally,it shows that the PTM-DWKNN algorithm can better classify imbalanced data sets with contrast experiments.
Keywords/Search Tags:Imbalanced data set, Mixed sampling, Feature selection, Classification
PDF Full Text Request
Related items