With the development of the society,the issue of imbalanced data classification appears widely in people’s daily life.This has become the hotspot in the field of data mining.In the imbalanced data,the minority class with a low proportion is often the object that needs to be identified.However,the results of traditional machine learning classification algorithms for the imbalanced data are often prone to the majority class with a high proportion.It is difficult for these classification algorithms to accurately distinguish minority instances.At present,the solutions for imbalanced data classification are roughly divided into oversampling and undersampling from the perspective of data and cost-sensitive learning and ensemble learning from the perspective of algorithms.This paper mainly studies the methods for imbalanced data classification from the oversampling level and applies proposed methods to the testing of pesticide residues.The specific contents are as follows:For imbalanced data with overlaps between classes,Synthetic Minority Over-sampling Technique(SMOTE)and some improved oversampling methods based on it cannot avoid synthesizing new instances that overlap with original majority instances,which is easy to cause over-fitting to the classification models.To solve this problem,this paper proposes the Clustering-based Improved Adaptive Synthetic Minority Oversampling Technique(CIASMOTE).It uses the Euclidean Distance Clustering Algorithm to combine the connected minority instances into sub-clusters of different sizes,then chooses semi-safe minority subclusters that are relatively close to the decision boundary as the candidate sub-clusters.Last,it adaptively synthesizes more instances by SMOTE to those candidate sub-clusters which have the higher degree of sparseness and are closer to the majority class.According to analyses of the experimental results of CI-ASMOTE and five other comparative oversampling methods on seven imbalanced datasets with overlaps between classes,CI-ASMOTE is more helpful to improve the recognition rate of the minority instances in imbalanced data and avoid overfitting.In view of the shortcoming of the CI-ASMOTE method in the lack of the diversity of synthetic instances in the same minority sub-cluster,this paper proposes the Cluster-based Improved Adaptive Two-Step Synthetic Minority Oversampling Technique(CI-ATS).It adopts Two-step SMOTE(TSMOTE)instead of SMOTE to synthesize instances within candidate minority sub-clusters to improve the diversity of new instances within the same sub-cluster.Inspired by ensemble learning methods and considering the defect of unstable performances of CI-ATS caused by the randomness of minority sub-clusters’ centers,the CI-ATS is further combined with the Ada Boost ensemble learning method.Finally,the Cluster-based Improved Adaptive Two-Step Synthetic Minority Oversampling Ensemble Algorithm(CI-ATSE)is proposed in this paper.CI-ATS,CI-ATSE,CI-ASMOTE and five oversampling methods are compared on ten real imbalanced datasets.The experimental results show that,compared with CI-ASMOTE and five oversampling methods,CI-ATS and CI-ATSE can more significantly improve the classification results of imbalanced data.And the superiority and stability of CIATSE are better than that of CI-ATS.The methods for imbalanced data classification proposed in this paper are applied to the testing of pesticide residues,and three classification models based on CI-ASMOTE,CI-ATS and CI-ATSE are established.The classification results of the self-tested near-infrared spectral data of Chinese cabbage verify the effectiveness of the three proposed methods in solving the real imbalanced data classification.And the classification model based on CI-ATSE is the optimal. |