Font Size: a A A

Research On An Improved Oversampling Method Of Unbalanced Data Set And Parallel Algorithm

Posted on:2022-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y GaoFull Text:PDF
GTID:2518306488966709Subject:Engineering
Abstract/Summary:PDF Full Text Request
An unbalanced data set refers to a large gap in the number of samples contained in each category in the data set.The category that contains more samples is the majority(counter-example)sample,and vice versa is the minority(positive-example)sample.In the unbalanced data sets,the information contained in the minority samples is more valuable for mining.The oversampling method can balance the number of positive and negative samples by increasing the number of positive samples,so improve the classification accuracy of imbalanced data sets.For this reason,an unbalanced data oversampling method ACC-SMOTE based on ant colony clustering is proposed.The existing oversampling methods are relatively inefficient in dealing with high-dimensional and massive imbalanced data sets.However,this article uses the Spark platform to process massive data sets with fast,efficient and comprehensive analysis capabilities,the ACC-SMOTE algorithm is designed in parallel on the Spark platform,and the Spark-based unbalanced data oversampling parallel algorithm ACCOP is proposed.The specific work is as follows:For the proposed algorithm ACC-SMOTE based on ant colony clustering for unbalanced data over sampling,the algorithm design mainly include the following two aspects: On the one hand,according to the improved ant colony clustering algorithm,all positive samples are divided into several sub-clusters,this takes into account not only the imbalance of the number of samples between classes,but also the imbalance of samples within a few classes.The SMOTE algorithm is used for oversampling according to the proportion of samples occupied by sub-clusters,thereby reducing the imbalance of data between classes and within classes.On the one hand,the oversampled minority samples are corrected in time by using the data cleaning technology of Tomek Links,and removing noise from a few samples,so ensure the quality of synthetic samples.The training set and test set used in this article are both UCI data sets.Experimental results show that this algorithm can synthesize higher-quality minority samples,thereby the classification accuracy of the classifier to unbalanced data sets is improved obviously.For the proposed Spark-based parallel algorithm ACCOP for unbalanced data oversampling,the idea of the algorithm is to implement the parallel algorithm design of ant colony clustering algorithm stage and oversampling algorithm stage in spark cluster platform.In the ant colony clustering stage,each ant determines the cluster and the center node,and then calculates the objective function value to design a parallel algorithm.The oversampling stage,the parallel algorithm is designed to calculate the Euclidean distance between the positive samples in all nodes.The experimental results show that the execution time of the algorithm in spark cluster platform is significantly shorter than that in a single machine environment when the number of computer nodes is constant and the number of dataset samples increases,which shows that the algorithm is parallel the ability to process large data sets,thereby improving the efficiency of algorithm execution.
Keywords/Search Tags:Imbalanced datasets, Oversampling, Ant colony clustering, Tomek links, Spark framework
PDF Full Text Request
Related items