Font Size: a A A

Improved Algorithm Of Class Imbalance Learning And Its Distributed Research

Posted on:2021-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:X W LiuFull Text:PDF
GTID:2428330602476839Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data refers to a data set with a large difference in the number of various samples in the data set.The purpose of the class imbalance learning algorithm is to improve the classification performance of traditional classifiers on imbalanced data.Moreover,it is difficult to handle the imbalanced big data effectively on a single computer.This thesis has made in-depth research and improvement on the classification imbalance learning algorithm and the imbalanced big data classification problem:(1)Aiming at the problem of the performance degradation of traditional classifiers on imbalanced data sets,the FCMUSIC algorithm(Fuzzy C-means clustering Based Under Sampling In Clusters,FCMUSIC)is proposed.First,the hierarchical clustering algorithm is used to determine the suitable number of cluster numbers for the majority class samples,and then,the fuzzy c-means clustering algorithm is used to divide the majority class samples into several clusters.In each cluster,the reciprocal of the imbalanced ratio(IR)is used as the sampling rate.At the same time,the phenomenon of imbalance in the major class samples is also considered,and the samples of small disjunctions in the major class samples are found and added to the new major class sample set,ensured the diversity and representativeness of the sample.The obtained balanced set was combined with KNN and Random Forest classifier to classify The experimental results show that the FCMUSIC algorithm has better classification performance than the comparison group algorithms,which verifies the effectiveness of the FCMUSIC algorithm.The combination of the FCMUSIC algorithm with different classifiers improves the classification performance of the classifier,indicating that the algorithm is independent.(2)In this thesis,the KNN-CBUS algorithm is proposed to improve the perfomance of CBUS algorithm.Using the k-nearest neighbor samples' information of major class samples to delete part of the major class samples to expand the classification boundary,making the classification hyperplane clearer.Besides,some noise samples in the minor class samples are deleted to reduce the interference of the noise samples to the classifier.Then,apply the CBUS algorithm to the processed samples.The experimental result shows that the KNN-CBUS algorithm further improves the F1 value,G-mean and AUC value compared with the CBUS algorithm.And the 1NN-CBUS algorithm has a greater improvement than the 2NN-CBUS.The KNN-CBUS algorithm has more advantages in processing imbalanced data than the CBUS algorithm.(3)When the classification algorithm executed on a single computer,it is difficult to effectively deal with the problem of imbalanced big data.Based on the Hadoop platform,the PFCMUSIC-RF algorithm is implemented to classify imbalanced big data in parallel.The running results on the Hadoop distributed cluster show that the algorithm has a classification performance equivalent to that of serial execution,and has good performance in terms of acceleration ratio and scale growth,proving the PFCMUSIC-RF algorithm has the ability to effectively deal with imbalanced big data.
Keywords/Search Tags:imbalanced data, classification, under sampling, clustering, distributed architecture
PDF Full Text Request
Related items