| More and more data are generated in various industries and fields in the society such as Internet,medical,financial industry and so on.The problem of data imbalance generally exists in these data.It is difficult to classify the imbalanced data using traditional classification algorithms,and the classification accuracy is low.How to classify imbalanced data and improve the classification performance has become the focus of our research.This paper starts from the data processing and classification algorithm,and improves SMOTE algorithm and AdaBoost algorithm to solve the above-mentioned shortcomings:(1)In order to solve the problems of unconsidered intra-class imbalance in the existing oversampling algorithm,no selection of samples to be oversampled,noise removal,and sample overlap and distribution "marginalization" in the synthesis process,an improved oversampling method AGNES-SMOTE is proposed.The key of this algorithm is to use AGNES algorithm to cluster the majority class and the minority class samples respectively and divide the minority class clusters.Then the concept of sampling weight and probability distribution is proposed.According to the sampling weight,the number of samples to be synthesized in each minority cluster is determined to achieve balance.According to the probability distribution of the minority cluster,a roulette wheel is adopted to select the sample,and the sample is combined with its neighbor sample to synthesize a new sample.At the same time,in the process of synthesizing new samples,the centroid mode is used to limit the generated region.Experimental results show that the proposed algorithm effectively solves the problems of the existing over-sampling algorithms and improves the classification performance of the classifier.(2)In order to solve the limitations of weak classifier,weighting coefficient and sample update strategy of Adaboost algorithm,the F-AdaBoost integration algorithm based on AGNES-SMOTE was proposed.The idea of the algorithm is to initialize each data sample in the original data set first,set the same weight value for them,and use the weight value to carry on the backdown sampling on the original data set to get the sampling sample set.Secondly,the AGNES-SMOTE method was used to conduct the second sampling on the obtained sample set to obtain the balanced sample set and redistribute its sample weight;then use the weighted balanced sample set to train a weak classifier.Then use the weak classifier to predict the results of all samples in the original sample set,and the classification error rate is calculated according to the results.The weighted coefficient of the weak classifier is calculated by judging the error rate,and the sample weight is updated based on the Focal Loss idea.After multiple rounds of iteration,several weak classifiers are finally integrated into strong classifiers.Experiments show that the algorithm effectively solves the shortcomings of the AdaBoost algorithm and effectively improves the classification performance of the classifier.Experiment proved that: AGNES-SMOTE algorithm is compared with SMOTE,Kmean-SMOTE,Cluster-SMOTE and other algorithms through the obtained AUC value,Fmeasure value and G-mean value.Combining with the classifier,the algorithm in this paper achieves a better index value in the imbalanced dataset,and the classification effect is good.Compared with AdaBoost,SMOTEBoost,RUSBoost and other algorithms,the Recall,Precison and G-mean values obtained by F-AdaBoost integrated algorithm are all better than those obtained by other algorithms. |