| Now is an era of rapid development of big data.The importance of data is increasing day by day.How to obtain more effective information from data has become the focus of research,so the field of data mining has become increasingly prominent.In the field of data mining,there is a research hotspot--data classification.In the traditional theoretical research of data classification,it is often assumed that data sets are evenly distributed.But there are a large number of data sets with uneven distribution in real life,so the classification effect of such data is not very good.In this kind of unbalanced data set,minority samples are more important.The loss caused by misjudgment of minority samples is often much higher than the impact of majority samples.Therefore,how to improve the recognition rate of minority samples while ensuring the recognition rate of majority samples is the main task in the research of unbalanced data set classification.This paper mainly studies from the data level and the algorithm level,proposes an improved method,and selects F1,AUC and G-mean to evaluate the effectiveness of the method from multiple perspectives.At the data level,based on the traditional sampling methods,this thesis proposes a hybrid sampling method based on distance and density weight in view of some shortcomings of the existing methods.The hybrid sampling method pays more attention to the position information of samples,and gives more attention to the samples closer to the boundary and the more sparse samples.It generates independent sampling weights for each minority boundary sample through distance weight and density weight.When generating new samples,in order to reduce the synthesis of redundant samples,the hybrid sampling method expands the area of each sample to synthesize new samples,and increases one auxiliary sample to two auxiliary samples.The paper selects 12 unbalanced data sets for empirical analysis.The experiment shows that the mixed sampling method is more effective in dealing with the problem of unbalanced data sets than the other five traditional sampling methods.At the algorithmic level,this thesis believes that ensemble learning is better than individual learners in classification performance.Therefore,based on the Stacking integration idea,the first layer of this ensemble learning uses KNN model,Random forest model,Support vector machine(SVM)model and Xgboost model as the base model,and the second layer uses the Logistic regression model,which is abbreviated as XSKR-L algorithm.The thesis selects 12 unbalanced data sets for empirical analysis.The experiment shows that XSKR-L algorithm is more effective in dealing with unbalanced data sets than other five individual learners.In this thesis,the hybrid sampling method proposed at the data level and the XSKR-L classification algorithm proposed at the algorithm level are comprehensively applied to the network intrusion detection project in real life,providing a certain reference for dealing with the unbalanced data classification problem in real life. |