| With the advent of the information age,information data are produced all the time.These data need to be analyzed to get its value,data mining technology arises at the historic moment.At present,machine learning is one of the most popular and effective technology in data mining.Classification is an important research branch of machine learning.There are many traditional classification methods,such as decision trees,support vector machines and neural networks.However,these methods are suitable for data sets that are relatively balanced between classes,and there are some class-imbalance phenomena between classes or within classes in practical applications,which we call imbalanced data sets.Traditional classification model tends to be overwhelmed on majority class when it applied on imbalanced data,which is not suitable for real applications that focus more on minority class.Therefore,it is urgent to improve the performance of minority samples classification.At present,the research of imbalanced data classification is divided into two levels:algorithm-based and data set-based.The cost-sensitive method is a representative method at the algorithm level.This method assigns different weights to each sample so that the sum of the weights of the minority samples is the same as the sum of the majority samples’ weights.Research methods at the data set level are mainly sampling methods,including over-sampling and under-sampling.The former is to process minority samples and increase the number of minority samples,reducing the imbalance rate between classes and improving classification accuracy.The latter is to process majority samples to reduce the number of majority samples,thereby reducing the imbalance rate between classes and improving the classification accuracy.This dissertation researches the data set level to obtain sample spatial distribution information to improve imbalanced data sets’ classification performance.The main research contents are as follows:(1)A neighborhood-aware imbalanced data set oversampling method(NA-SMOTE)is proposed to learn the local neighborhood information of minority samples,and use the local neighborhood information to constrain the synthesis of samples in the oversampling process to reduce linear interpolation.The possible unfavorable factors,such as noise and sample overlap,to improve the efficiency of oversampling.First,perform multiple neighborhood information mining on the minority samples in the data set to obtain the neighborhood information of the minority samples after each mining,and then fuse the neighborhood information mined multiple times to get the irregular local neighborhoods of the minority samples.Finally,the SMOTE method is used to synthesize new samples in the local neighborhood.Among them,the distribution of noise samples is detected in neighborhood information mining so that the synthesis of noise samples can be effectively avoided when new samples are synthesized.Experiments on typical imbalanced data sets show that using the neighborhood information of minority samples as constraints can effectively improve oversampling efficiency.(2)The neighborhood-aware imbalanced data set undersampling method(WUS),the neighborhood-aware imbalanced data set ensemble,and the weighted undersampling method(WEUS-V)are proposed to learn the local neighborhood information of majority samples.Define the spatial information distribution index of majority samples according to the local neighborhood information,and use this indicator as the weight of the weighted undersampling to reduce the sampling probability of majority samples in the overlapping area,thereby improving the efficiency of undersampling.Perform multiple neighborhood information mining on majority samples in the data set to obtain the neighborhood information of majority samples after each mining,calculate the spatial information distribution index through function calculation,and finally use the spatial information distribution index of majority samples weight.The required majority samples are combined with the weighted random sampling method and combined with minority samples to form a balanced data set.Experiments on 100 imbalanced data sets show that using the spatial information distribution index of majority samples as the weight of weighted undersampling can effectively improve undersampling efficiency. |