Research On The Classification Algorithm Of Unbalance Data Based On Spark

Posted on:2019-10-05

Degree:Master

Type:Thesis

Country:China

Candidate:P Wang

Full Text:PDF

GTID:2428330566991428

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Internet,data is undergoing an explosive growth.People need to dig out valuable information from them and classification is one of the most basic methods.Unbalanced data classification means that the number of different classes in the data is quite different,and the classifier is less sensitive to the samples of minority class when classifying.As the amount of data increases,the number of samples of minority class will also increase.In the single machine environment,the traditional classification and clustering algorithms often need to be iterated many times until it reaches enough error,but sometimes it can not meet the classification of a large number of unbalanced data.Aiming at the low recognition of classifiers for minority class samples in the large scale imbalanced data,the following contents are studied.A within-class sampling classification method based on Spark is proposed in this thesis.Firstly,the defects of random undersampling are analyzed and propose to obtain the overall characteristics of the majority of the samples by clustering.From the clusters generated by the clustering,the maximum number is selected to execute second clustering.As for the smallest number,if it accounts for a very small proportion of the majority class samples,the cluster will be abandoned.Then,a proportional data sampling is made according to the number of clusters,and a balanced data set is formed with a few samples.Finally,the support vector machine algorithm in Spark MLlib is used to classify them.It is proved by experiments that the within-class sampling classification method based on Spark is better than random undersampling in the recongnition of minority samples.The classification effect of within-class sampling based on Spark is not obvious when the imbalanced ratio is increased.Therefore,a beween-class and within-class sampling classification method based on Spark is further proposed,which first cluster the data sets and discard the clusters with the ratio of the majority class samples and the minority class samples below the threshold value of 1.Secondly,the number of majority samples in each cluster is caculated by equal ratio according to the ratio of majority and the minority class samples.Then,the majority class samples in each cluster are clustered to generate a number of sub clusters,and the majority class samples are equal proportion extracted according to the number of sub clusters.Finally,a balanced data set is formed with a minority class samples,and the decision tree algorithm in Spark MLlib is used to classify them.It is proved by the experiment that the between-class and within-class sampling classification method is better than within-class sampling classification method based on Spark in the recongnition of minority samples.

Keywords/Search Tags:

Unbalanced Data, Spark, Cluster, Equal Proportion, Sampling, Support Vector Machine, Decision Tree

PDF Full Text Request

Related items

1	Research On Unbalanced Data Classification Based On Support Vector Mixed Sampling
2	Analysis And Application Of Telecommunications Data Based On Support Vector Machine And Decision Tree
3	Research On Some Problems And Applications In Support Vector Machines
4	Research On Algorithm And Its Application Based On Support Vector Machine
5	Research On Traditional Classification Model Based On Unbalanced Data
6	The Research And Application Of The Assessment System Of Suppliers Based On The SVM And Decision Tree Theory
7	Research On E-commerce Fraud Indentification Model Based On Data Mining
8	Classification Algorithm Of Unbalanced Datasets
9	Studies Of Several Mathematical Models And Algorithms Of Support Vector Machine
10	Research On Classification Method Of Random Support Vector Machine And Its Application