| In the era of rapid development of contemporary data,the scale of data increases exponentially,which is also indispensable in people’s lives,and also greatly affects people’s lives.Therefore,accurate and rapid extraction,identification,analysis and classification of important information in data through technical means,and efficient mining of valuable information for us has become an important research topic at present.Scholars have used machine learning as a means,and after years of research,they have proposed many mature models,all of which are designed for balanced data samples.However,with the continuous expansion of data application fields in daily life,the structure of data presents unbalanced characteristics,such as credit card fraud data,network attack identification,medical health prediction,fault diagnosis,etc.This type of data is usually called unbalanced data.Balance data.The unbalanced distribution of data samples has stricter requirements for data mining.When traditional classification algorithms are used to classify and experiment with unbalanced data,even if some minority samples are misclassified,the classification and recognition rate of minority classes is reduced,but traditional classifiers And traditional classification evaluation indicators will still have better performance.Therefore,when analyzing unbalanced data,it is necessary to pay attention to the analysis accuracy of the algorithm model for minority samples in unbalanced data,so as to reduce the misclassification rate of minority samples.use to improve classification accuracy.In this thesis,the background meaning of imbalanced data classification is firstly analyzed and researched,and the relevant theoretical knowledge of imbalanced data is analyzed through the research status at home and abroad.There are two ways,one is the processing method at the data level;the other is the processing method at the algorithm level.At the data level,oversampling technology is mainly used to equalize the data;at the algorithm level,the traditional classification algorithm is mainly optimized and improved.Finally,the relevant evaluation indicators of unbalanced data classification such as F-measure,Kappa,AUC,and G-mean are analyzed.Aiming at the problems that the traditional sampling process is easily affected by noise sample points(outliers)and the fuzzy decision boundary caused by the sample distribution in the data set,this research proposes an algorithm based on probability density function and adaptive oversampling.First,the minority class samples are divided into safety,boundary and noise samples according to their distribution;then the Rayleigh distribution is used to sample the safety samples and boundary samples,and the probability density function is used to construct the distribution density of the new samples.,the decision boundary is clarified while reaching a balanced dataset;secondly,random forest is used as a classifier;finally,the proposed algorithm is compared with a variety of representative algorithms on multiple unbalanced datasets,and the algorithm is verified effectiveness.Aiming at the artificial factors of noise sample points and parameters in traditional oversampling algorithms,a SMOTE oversampling algorithm based on natural nearest neighbors is proposed.First,identify the natural neighbors of the sample points through the natural nearest neighbors;then combine the SMOTE algorithm to sample the natural neighbor sample points,use the characteristics of the natural neighbors to identify the noise sample points and reduce the influence of external prior knowledge,in order to achieve a balance The purpose of the data set;secondly,the random forest is used as the classifier,and it is improved;in the final experiment,the algorithm is compared with the SMOTE algorithm and its optimization algorithm on 10 unbalanced data sets,which verifies the effectiveness of the algorithm.At the algorithm level,the random forest algorithm is improved,the node splitting rules are optimized,the information gain and Gini index are proposed to be combined,and the grid search method(Grid-research)is used for parameter optimization,and the appropriate number of decision trees and node extraction attributes are selected.number,in order to improve the performance of random forest algorithm in imbalanced data classification.This thesis studies from three aspects: optimizing decision boundary,identifying noise sample points,and reducing the influence of human prior knowledge.Optimizing the decision boundary will make the boundary between the minority class samples and the majority class samples relatively clear,which can effectively improve the model classification effect;the identification of noise sample points can reduce the influence of the distribution of noise sample points on the decision boundary when synthesizing new sample points.Angle reduces the influence of fuzzy decision boundary;natural neighbor algorithm can reduce the influence of human prior knowledge while identifying noisy sample points.Finally,the effectiveness of the above algorithms is proved by experiments.Finally,the sand and dust storm data in parts of Gansu are extracted from the "China Severe Dust Storm Sequence and Its Supporting Dataset" and "China’s Daily Value Dataset of Surface Climate Data".A model for the problem of disequilibrium data classification of regional sandstorms. |