| Sampling theory is the basic theory of many disciplines.By sampling method,we can not only obtain the approximate solution under the weak condition of accurate inference,but also accelerate the calculation based on the sampling method.In the Big Data times,the sampling method has been more widely used.The problem of data imbalance is a classic problem in the field of machine learning,and the solution idea based on data level includes data oversampling and data undersampling.Data-driven models will encounter more imbalanced data in the face of problems in real-time scenarios.The main research work of this paper is as follows:Firstly,this paper studies the influence of data imbalance on the evaluation of classification models in machine learning,and by means of visualization,it is proved by experiments that data imbalance will bring negative effect to model learning.Aiming at two kinds of ideas to solve the problem of data unbalanced learning from the data level,oversampling method and undersampling method,the validity of sampling method is proved from the angles of theoretical analysis and comparative experiment respectively,which lays a theoretical and experimental foundation for the follow-up research work.Secondly,some research has introduced evolution idea into sampling algorithms,and related algorithms combined with adaptive Lévy distribution are proposed.This paper improves the evolutionary sampling algorithm based on Lévy distribution.By setting the parameter α of this distribution to 1.0,1.3,1.7,2.0,corresponding to the four transition probability distributions,the diversity of the generated candidate samples is increased.Theoretical analysis and experimental results show that the proposed algorithm is superior to the evolutionary sampling algorithm based on Gaussian distribution,Cauchy distribution,symmetrical exponential distribution and other adaptive evolutionary sampling algorithms in terms of convergence rate and accuracy.Thirdly,for oversampling problems on imbalanced data sets,after thorough analysis based on the distribution of the Lévy sampling method,the choice of sampling rate generation function does not necessarily have to be Lévy distribution,therefore,data sampling methods based on the Gaussian distribution and piecewise distribution are proposed.The density of new samples synthetized from the borderlines is the largest,the density of new samples synthetized from the samples closer to the majority is the second largest,and the density of new samples synthetized from the samples closer to the minority is the smallest.Thus,this approach can enhance the decision boundary and reduce the noise generation.Experiments on multiple datasets show that the proposed approach can effectively improve the classification results on imbalanced datasets. |