| In recent years,machine learning,as an important part of artificial intelligence,is becoming more and more familiar to people.Faced with the explosive growth of data,it is extremely unrealistic to process and analyze it manually.And machine learning is an irreversible trend nowadays.Imbalanced data classification,as a hot issue in machine learning,is more widely applied into real world when playing an important role in some scenes such as diagnosis of diseases,credit card fraud,network intrusion detection.In unbalanced classification,traditional classifiers aim at improving the overall classification accuracy,which can not really reflect the actual performance of the classifier,and its classification accuracy on a few classes is poor.So it is always a hot point that how to improve the classification of minority classes.And many scholars propose their effective strategies.Through the analysis of domestic and foreign literatures,this paper argues that the classification accuracy can be improved by improving the under-sampling method.In unbalanced classification,under-sampling,which reduces the number of majority classes in the data set to balances the number of different classes by some strategies,such as random sampling,is a classical method to improve the phenomenon of imbalanced data classification.Due to discarding much data,enlarging scale of noise samples in the majority class and changing data distribution,then noise samples exert a bad influence to application of the theory.In terms of the problem that noise samples persist existing in under-sampling algorithm,this paper proposed an imbalanced data classification algorithm based on clustering ensemble under-sampling.This algorithm removes noise samples from the majority class by combining clustering ensemble with Isolation Forest to improve the quality of samples,and continues to classification of minority class.Then this paper combines this algorithm with XGBoost to get RUIF-XG algorithm to deal with imbalanced classification.The experiment result on seven data sets that come from UCI and KEEL shows that: Compared with other under-sampling algorithms or directing classification methods,this algorithm has some improvement on both AUC value and F1 value to a certain extent when solving the problem of unbalanced classification.And this paper makes the experimental conclusions more convincing by using Wilcoxon Signed Rank Test on experimental results.It proves that there are significant differences in the improvement of evaluation indicators.Finally,this paper improves performance of prediction with this algorithm on protein subcellular localization.The result shows this algorithm increases AUC value and F1 value by 11.83% and 6.97% compared with directing to classify,by 2.72% and 2.54% compared with suboptimal under-sampling algorithm.At the same time,there are still some shortcomings in this paper.One reason is that the condition of using Isolation Forest,an anomaly detection algorithm,is the requirement of the size of subsample set and the definition of noise.On the other hand,some parameters in the algorithm,such as the times of clustering,the proportion of noise deletion,are difficult to calculate and obtain the optimal value,which can only be obtained by the empirical data.In the future research,we can start from these aspects,to perfect the model theory,and to improve the scientific and effective conclusion. |