Font Size: a A A

Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning

Posted on:2021-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:M H WangFull Text:PDF
GTID:2428330620970565Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The era of big data has come,and the emergence of big data makes it impossible for traditional machine learning algorithms to complete training in a stand-alone computing environment.Classification is the most basic learning task in the field of machine learning and data mining.Traditional classification algorithms are designed on the premise that the processed data is class-balanced,but in many practical applications,the data to be processed is imbalanced data.Therefore,to study the problem of imbalanced data classification,especially the problem in big data environment,has important theoretical and application value.In class-imbalanced problem,according to the number of classes contained in the processing data,the imbalanced problems can be divided into two categories: two-class imbalanced data classification and multi-class imbalanced data classification.This paper mainly studies the problem of two-class imbalanced data classification in large-scale data environment and proposes two solutions based on MapReduce and Spark parallel computing framework.Specifically,the work of this paper mainly includes the following four parts:(1)Studied the parallelization of X-means algorithm in the big data environment,and proposed large-scale adaptive X-means clustering algorithms based on MapReduce and Spark.In the two-class imbalanced big data classification,the negative class big data is viewed as the data set without class labels,and the negative class big data is adaptively clustered through the large-scale X-means adaptive clustering algorithms.(2)Proposed large-scale condensed fuzzy k-nearest neighbor algorithms based on MapReduce and Spark.The clustering results obtained by the large-scale X-means algorithms are regarded as the data set with class labels,and the large-scale condensed fuzzy K-nearest neighbor algorithms are used to carry out under sampling for negative class samples and reduce the number of negative class samples.(3)Merge each negative class cluster and positive class samples after the down sampling to form multiple training sets.If the training set is still imbalanced data set,the over samplingmethod is used for the positive samples to form the balanced training set,and the classifier is trained on the balanced training set.The fuzzy integral method is used to integrate the classifiers trained with different training sets,and obtain the final classification results.(4)On 7 imbalanced large data sets,the two-class imbalanced large-scale classification algorithms based on MapReduce and Spark are experimentally compared on different measures,also experimentally compared with other related algorithms.The experimental results showed that the algorithm proposed in this paper are effective.
Keywords/Search Tags:Big data, imbalanced problem, MapReduce, Spark, Adaptive clustering algorithm, Sample selection, Ensemble learning
PDF Full Text Request
Related items