| Data imbalance refers to the phenomenon that there is a significant difference in the number of positive and negative samples within a data set.The traditional classification algorithms has poor classification performance for positive samples on imbalanced data sets,which affects people’s data mining analysis of the type of samples they are interested in.Nowadays,the problem of data imbalance is becoming increasingly common,and there are many application areas where data imbalance exists,such as medical diagnosis,fraud detection,and object detection.Therefore,related researchers have proposed many method to solve the problem,mainly including data resampling,improving existing classification algorithms,and data resampling combined with ensemble learning and so on.The existing methods of data resampling combined with ensemble learning are prone to losing important majority class samples,resulting in poor classifier learning performance and thus affecting the classification accuracy of imbalanced data.To address this issue,this thesis proposed an ensemble algorithm for imbalanced data called DPFBoost based on the density peaks clustering and fitness.Additionally,in order to shorten the clustering time of the density peaks clustering algorithm used by DPFBoost on large data sets and improve its applicability,this thesis optimizes the algorithm based on the Spark and proposes a parallel algorithm called Spark DPC based on Spark.The overall research contents are as follows:(1)An imbalanced data ensemble algorithm DPFBoost based on density peaks clustering and fitness is proposed.The main method involves clustering the majority class in imbalanced data set and calculating the amount of undersampling needed for each generated cluster.The number of under-sampling of different cluster is related to the density of the clusters.Then,the local density of samples in the clusters is used as the fitness of the sample,and each class cluster is under-sampled separately.Finally,combine with the ensemble learning,the classification performance of the algorithm for imbalanced data sets is improved through repeated sampling and iterative training.The experimental results show that the classification accuracy of the DPFBoost algorithm for imbalanced data is higher than that of other comparison algorithms,which indicates that the proposed algorithm performs better in classifying imbalanced data sets.(2)Spark DPC mainly parallelizes the two steps of calculating the Euclidean distance between samples and the local density of sample points in the density peaks clustering algorithm.Through analysis on density peaks clustering algorithm,the time complexity in calculating the Euclidean distance and local density is all O(n~2).When the amount of data is large,the calculation of these two steps will cause a lot of time overhead.Therefore,the two steps are designed in parallel based on Spark in this thesis to improve the efficiency of the algorithm.The experimental results show that when the Spark DPC algorithm runs on a Spark cluster composed of multiple nodes,the speedup of the algorithm increases with the increase of the data amount,so the higher the clustering efficiency of this algorithm is,which indicates that Spark DPC is more suitable for clustering large data set. |