Font Size: a A A

Research On The Classification Algorithm Of Unbalance Data Based On Spark

Posted on:2019-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2428330566991428Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet,data is undergoing an explosive growth.People need to dig out valuable information from them and classification is one of the most basic methods.Unbalanced data classification means that the number of different classes in the data is quite different,and the classifier is less sensitive to the samples of minority class when classifying.As the amount of data increases,the number of samples of minority class will also increase.In the single machine environment,the traditional classification and clustering algorithms often need to be iterated many times until it reaches enough error,but sometimes it can not meet the classification of a large number of unbalanced data.Aiming at the low recognition of classifiers for minority class samples in the large scale imbalanced data,the following contents are studied.A within-class sampling classification method based on Spark is proposed in this thesis.Firstly,the defects of random undersampling are analyzed and propose to obtain the overall characteristics of the majority of the samples by clustering.From the clusters generated by the clustering,the maximum number is selected to execute second clustering.As for the smallest number,if it accounts for a very small proportion of the majority class samples,the cluster will be abandoned.Then,a proportional data sampling is made according to the number of clusters,and a balanced data set is formed with a few samples.Finally,the support vector machine algorithm in Spark MLlib is used to classify them.It is proved by experiments that the within-class sampling classification method based on Spark is better than random undersampling in the recongnition of minority samples.The classification effect of within-class sampling based on Spark is not obvious when the imbalanced ratio is increased.Therefore,a beween-class and within-class sampling classification method based on Spark is further proposed,which first cluster the data sets and discard the clusters with the ratio of the majority class samples and the minority class samples below the threshold value of 1.Secondly,the number of majority samples in each cluster is caculated by equal ratio according to the ratio of majority and the minority class samples.Then,the majority class samples in each cluster are clustered to generate a number of sub clusters,and the majority class samples are equal proportion extracted according to the number of sub clusters.Finally,a balanced data set is formed with a minority class samples,and the decision tree algorithm in Spark MLlib is used to classify them.It is proved by the experiment that the between-class and within-class sampling classification method is better than within-class sampling classification method based on Spark in the recongnition of minority samples.
Keywords/Search Tags:Unbalanced Data, Spark, Cluster, Equal Proportion, Sampling, Support Vector Machine, Decision Tree
PDF Full Text Request
Related items