Font Size: a A A

Research On The Processing Method Of Unbalanced Data Set Based On Improved SMOTE Algorithm

Posted on:2021-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y L YuFull Text:PDF
GTID:2428330620472145Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,all kinds of information data bring different useful information to people.Among them,the useful information contained in the unbalanced data set is gradually being tapped and used by people,and the internal data distribution of the unbalanced data set is serious.Unbalanced and the high probability of useful information among them only occupies a relatively small amount,but the occurrence of minority time among them has a very important use value for the field to which it belongs,confirming that due to the scarcity of minority samples,R & D personnel are required Invest more energy to analyze and mine the useful information.Among the many sampling algorithms for unbalanced data sets,the SMOTE algorithm effectively solves the randomness of previous data sampling methods and eliminates the phenomenon of overfitting caused by the increase of random sample points.The proposed algorithm provides a powerful guideline for dealing with unbalanced data sets,but it also has certain limitations.The SMOTE algorithm requires a small number of sample points to be randomly selected and requires its k nearest neighbors to be found.However,the algorithm does not clearly indicate how to determine the value of k,so it can only be tested based on the obtained data set to obtain the optimal solution of k.This approach makes k worth choosing blindly and wastes researchers ' Time resources.When a new sample point is generated when a nearest neighbor is found,when a new sample point is generated based on the sample point at the boundary,the new sample point will be more and more marginalized,and it will gradually blur the positive and negative class boundaries of the sample and also affect the original Some data is distributed.The specific work of this article is as follows.First of all,a theoretical analysis of the problems of the SMOTE algorithm is carried out.On the basis of the previous Boderline-SMOTE algorithm proposed by scholars,the algorithm is improved,and the new KB-SMOTE algorithm is proposed in conjunction with the K-means clustering algorithm.The new algorithm performs clustering on the minority classes before sampling the data set.After the clustering is completed,each cluster is judged.According to the Boderline-SMOTE algorithm,the conditions of each sample point are determined to distribute each cluster to its corresponding affiliation.In the collection: noise cluster collection,boundary cluster collection,security cluster collection.After obtaining the set of each cluster,only all the clusters in the boundary cluster set are generated new sample points,the majority of samples in the boundary clusters are removed,and new sample points are generated according to the new interpolation formula.This method is an effective solution The SMOTE algorithm also solves the problem of blurring the classification boundary of the data set when determining its k-nearest neighbors.The generation of new sample points within the cluster also reduces the impact on the original data distribution.Secondly,in order to verify the effectiveness of the new algorithm,a data set for credit card fraud detection is selected,and the data set is adopted respectively: using the original data set,random downsampling method to process the data set,SMOTE algorithm processing data set,and KB-SMOTE algorithm processing For the data set,the data set obtained by the above methods are used to train the logistic regression model,and the model is optimized by using 5-fold cross-validation and setting regularization penalty items.Finally,the test set is sequentially taken into the logistic regression model trained by the above method,and the effectiveness of the KB-SMOTE algorithm is confirmed by comparing and analyzing the classification performance of the model.
Keywords/Search Tags:Lopsided Number set, K-means Algorithm, Bat Algorithm, Credit Card Fraud Detection
PDF Full Text Request
Related items