Font Size: a A A

Research For Imbalanced Big Data Classification Algorithm On Random Forest

Posted on:2019-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:C GaoFull Text:PDF
GTID:2348330545992096Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,the data on the Internet is growing rapidly,and the application of big data is becoming the focus of attention.However,the data generated by actual applications are characterized by high dimension and imbalance,which poses a challenge to the classification of big data.Therefore,this paper combines class discrimination degree with K-means algorithm,which reduces the dimensions of high-dimensional features and chooses effective subsets with lower dimensions to improve the classification accuracy and efficiency.Then,the cost-sensitive random forest algorithm is improved to adapt to the classification of imbalanced data.Finally,the cost-sensitive random forest algorithm was designed in parallel with MapReduce to achieve the classification of imbalanced big data.First of all,aimed at the redundancy between features and easily ignored the strong correlation characteristics of minority class in high-dimensional imbalanced data sets.This paper proposes a new feature selection based on class discrimination degree on high-dimensional imbalanced data,which clusters of all features by K-means and calculates the class discrimination of each feature in the cluster.The importance sorting of each cluster is achieved by using class discrimination degree.And the attributes with higher degree of class discrimination degree in each cluster are selected to form the attribute set after dimensionality reduction.It guarantees the number of strong-relevant features of minority class a certain degree,and processes the high-dimensional feature redundancy and imbalanced features.In this paper,two groups of high dimensional imbalance text sets are used to verify the algorithm with information gain,chi-square statistics and other algorithms.The results show that this method can effectively handle high-dimensional data.Secondly,aimed at that it is easy to be biased towards majority class while ignoring the problems of a few classes of imbalanced data classification,this paper proposes cost-sensitive random forest classification method.This method constructs cost functions according to the actual distribution of imbalanced data sets,and introduces the weight distance to the cost function.Then,according to the performance of the base classifier,weight voting is adopted to improve the accuracy of classification.In this paper,six sets of UCI sample sets are used to verify the decision tree,random forest,cost sensitive random forest and the algorithm.The results show that this method can effectively improve the classification performance of minority class on the basis of guaranteeing the overall classification performance.Finally,in order to deal with imbalanced big data,it will take a lot of time to model and vote,which seriously affects the performance of the classifier.This paper uses the MapReduce to parallelize the design of the cost-sensitive random forest algorithm,and carries out triple parallel design in the base classifier modeling process,attribute splitting process and voting process.And it can accelerate the construction of the base classifier.This improves the classification performance of cost-sensitive random forests when dealing with imbalanced big data.In this paper,four groups of imbalanced big data are used to verify the algorithm.The experimental results show that the parallel design of random forest based on MapReduce greatly improves the classification speed and effectively deals with imbalanced big data.
Keywords/Search Tags:class discrimination degree, random forest, cost-sensitive, imbalanced big data
PDF Full Text Request
Related items