Research For Imbalanced Big Data Classification Algorithm On Random Forest

Posted on:2019-03-30

Degree:Master

Type:Thesis

Country:China

Candidate:C Gao

Full Text:PDF

GTID:2348330545992096

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology,the data on the Internet is growing rapidly,and the application of big data is becoming the focus of attention.However,the data generated by actual applications are characterized by high dimension and imbalance,which poses a challenge to the classification of big data.Therefore,this paper combines class discrimination degree with K-means algorithm,which reduces the dimensions of high-dimensional features and chooses effective subsets with lower dimensions to improve the classification accuracy and efficiency.Then,the cost-sensitive random forest algorithm is improved to adapt to the classification of imbalanced data.Finally,the cost-sensitive random forest algorithm was designed in parallel with MapReduce to achieve the classification of imbalanced big data.First of all,aimed at the redundancy between features and easily ignored the strong correlation characteristics of minority class in high-dimensional imbalanced data sets.This paper proposes a new feature selection based on class discrimination degree on high-dimensional imbalanced data,which clusters of all features by K-means and calculates the class discrimination of each feature in the cluster.The importance sorting of each cluster is achieved by using class discrimination degree.And the attributes with higher degree of class discrimination degree in each cluster are selected to form the attribute set after dimensionality reduction.It guarantees the number of strong-relevant features of minority class a certain degree,and processes the high-dimensional feature redundancy and imbalanced features.In this paper,two groups of high dimensional imbalance text sets are used to verify the algorithm with information gain,chi-square statistics and other algorithms.The results show that this method can effectively handle high-dimensional data.Secondly,aimed at that it is easy to be biased towards majority class while ignoring the problems of a few classes of imbalanced data classification,this paper proposes cost-sensitive random forest classification method.This method constructs cost functions according to the actual distribution of imbalanced data sets,and introduces the weight distance to the cost function.Then,according to the performance of the base classifier,weight voting is adopted to improve the accuracy of classification.In this paper,six sets of UCI sample sets are used to verify the decision tree,random forest,cost sensitive random forest and the algorithm.The results show that this method can effectively improve the classification performance of minority class on the basis of guaranteeing the overall classification performance.Finally,in order to deal with imbalanced big data,it will take a lot of time to model and vote,which seriously affects the performance of the classifier.This paper uses the MapReduce to parallelize the design of the cost-sensitive random forest algorithm,and carries out triple parallel design in the base classifier modeling process,attribute splitting process and voting process.And it can accelerate the construction of the base classifier.This improves the classification performance of cost-sensitive random forests when dealing with imbalanced big data.In this paper,four groups of imbalanced big data are used to verify the algorithm.The experimental results show that the parallel design of random forest based on MapReduce greatly improves the classification speed and effectively deals with imbalanced big data.

Keywords/Search Tags:

class discrimination degree, random forest, cost-sensitive, imbalanced big data

PDF Full Text Request

Related items

1	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
2	Research On Imbalanced Data Issue In SAR Target Discrimination
3	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification
4	Research On Rotation Forest Algorithm For Imbalanced Data Classification Problem
5	The Imbalanced Data Classification Algorithm Based On Integrated Learning And Its Application In Product Quality Discrimination
6	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
7	Research On Multi-View Classification With Cost-Sensitive
8	The Application Of Improved AdaBoost Algorithm Based On Cost Sensitive In Imbalanced Data
9	A Research On The Application Of Telecom Customer Churn Prediction Based On Random Forest
10	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm