Improved Algorithm Of Class Imbalance Learning And Its Distributed Research

Posted on:2021-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:X W Liu

Full Text:PDF

GTID:2428330602476839

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Imbalanced data refers to a data set with a large difference in the number of various samples in the data set.The purpose of the class imbalance learning algorithm is to improve the classification performance of traditional classifiers on imbalanced data.Moreover,it is difficult to handle the imbalanced big data effectively on a single computer.This thesis has made in-depth research and improvement on the classification imbalance learning algorithm and the imbalanced big data classification problem:(1)Aiming at the problem of the performance degradation of traditional classifiers on imbalanced data sets,the FCMUSIC algorithm(Fuzzy C-means clustering Based Under Sampling In Clusters,FCMUSIC)is proposed.First,the hierarchical clustering algorithm is used to determine the suitable number of cluster numbers for the majority class samples,and then,the fuzzy c-means clustering algorithm is used to divide the majority class samples into several clusters.In each cluster,the reciprocal of the imbalanced ratio(IR)is used as the sampling rate.At the same time,the phenomenon of imbalance in the major class samples is also considered,and the samples of small disjunctions in the major class samples are found and added to the new major class sample set,ensured the diversity and representativeness of the sample.The obtained balanced set was combined with KNN and Random Forest classifier to classify The experimental results show that the FCMUSIC algorithm has better classification performance than the comparison group algorithms,which verifies the effectiveness of the FCMUSIC algorithm.The combination of the FCMUSIC algorithm with different classifiers improves the classification performance of the classifier,indicating that the algorithm is independent.(2)In this thesis,the KNN-CBUS algorithm is proposed to improve the perfomance of CBUS algorithm.Using the k-nearest neighbor samples' information of major class samples to delete part of the major class samples to expand the classification boundary,making the classification hyperplane clearer.Besides,some noise samples in the minor class samples are deleted to reduce the interference of the noise samples to the classifier.Then,apply the CBUS algorithm to the processed samples.The experimental result shows that the KNN-CBUS algorithm further improves the F1 value,G-mean and AUC value compared with the CBUS algorithm.And the 1NN-CBUS algorithm has a greater improvement than the 2NN-CBUS.The KNN-CBUS algorithm has more advantages in processing imbalanced data than the CBUS algorithm.(3)When the classification algorithm executed on a single computer,it is difficult to effectively deal with the problem of imbalanced big data.Based on the Hadoop platform,the PFCMUSIC-RF algorithm is implemented to classify imbalanced big data in parallel.The running results on the Hadoop distributed cluster show that the algorithm has a classification performance equivalent to that of serial execution,and has good performance in terms of acceleration ratio and scale growth,proving the PFCMUSIC-RF algorithm has the ability to effectively deal with imbalanced big data.

Keywords/Search Tags:

imbalanced data, classification, under sampling, clustering, distributed architecture

PDF Full Text Request

Related items

1	An Imbalanced Data Classification Algorithm Combining Clustering With Sampling Strategy
2	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
3	Imbalanced Classification Algorithm Based On Clustering Ensemble Under-Sampling
4	Research On Imbalanced Data Sampling Methods For Text Sentiment Classification
5	Research On Decision Tree Classification Method Of Imbalanced Data Based On Reinforcement Learning
6	The Research Of Imbalanced Data Classification
7	The Algorithm Research Of Associative Classification And Classification Based On Imbalanced Data
8	Application Of Clustering Based Sampling Algorithms In Unbalanced Data Learning
9	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
10	Camplaints Text Classification Research Of Imbalanced Data Sets