| In the fields of machine learning and data mining,the lack of minority-class samples may cause the decision boundary to be biased towards the majority-class samples,resulting in the performance degradation of classifiers.In the past two decades,many algorithms have been proposed to solve data imbalance,among which oversampling algorithms are widely popular due to their simplicity.Researchers have done a lot of work on oversampling,the existing oversampling algorithms still have their own defects.On the basis of studying existing oversampling algorithms,this thesis proposes three effective oversampling algorithms.Specifically,the main contributions of this thesis are described as follows:(1)This thesis proposes a δ Neighborhood Denoising-based Oversampling Technique(δ-NSMOTE)for imbalanced learning.Generally,oversampling algorithms filter minorityclass noisy samples by using the Euclidean distance denoising(EDD)strategy.However,if minority-class noisy samples locate in the majority-class region and are close to each other,the EDD strategy is invalid.To remedy it,this thesis designs the Chebyshev distance based on δ Neighborhood denoising strategy and then proposes δ-NSMOTE.In addition,δ-NSMOTE also introduces the concept of relative density to calculate the number of new samples that should be generated around the original samples.On both artificial and realworld datasets,we validate that δ-NSMOTE is superior to other related algorithms in terms of noise filtering and density measurement.(2)This thesis proposes a Minority-Prediction-Probability-based Oversampling Technique(MPPOT)for imbalanced learning.To address the problem that the distribution of newly synthetic data generated by exiting algorithms is not consistent with that of the original minority-class samples,this thesis uses the idea of divide-and-conquer to propose MPPOT.The aim of predicting the class probability of minority-class samples is to divide these samples into two types(hard-to-learn and easy-to-learn),and the divide-and-conquer strategy is to carry out different density measurements and new sample generation schemes for the two types.Therefore,the distribution of newly generated samples obtained by MPPOT is basically consistent with that of original minority-class samples.Experimental results on artificial and real-world datasets show that MPPOT has a better data distribution consistency.(3)This thesis proposes a Clustering Fusion-based Oversampling Technique(CFOT)for imbalanced learning.MPPOT does not take into account the closeness density of subclusters in the minority class,which may generate new noisy samples across subclusters.To solve this issue,this thesis proposes CFOT based on MPPOT.CFOT adopts the clustering strategy to distinguish closeness density of subclusters,so that we can select samples in subclusters with high closeness density to generate new samples,and then further improve the performance of the oversampling algorithm.Experimental results show that CFOT can generate better samples between multiple subclusters and have better classification performance. |