Font Size: a A A

Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization

Posted on:2019-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:S C WangFull Text:PDF
GTID:2438330566983716Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Imbalanced data is ubiquitous in practical applications.In order to pursue the overall classification performance,the traditional classification algorithms are usually based on balancing the distribution of data or ignoring minority class samples in the sample,which result in the problem that the classification accuracy of minority class samples is not ideal.Therefore,it is important theoretical and practical significance to study and design a classification algorithm that can effectively solve the imbalanced data,which is used to improve the classification accuracy of minority class samples and the overall performance of the classifier.The integrated classifier algorithms deal with imbalanced data classification problems,which can balance errors over a certain range.And the random forest classification algorithm is one of the integrated classifier algorithms,but when the data is in serious imbalance,the classification effect of the random forest is not obvious.Moreover,when the data set contains many noise and redundant features,Therefore,the random forest classification model constructed using these features will lead to classification effect is not ideal when performing prediction classification.Therefore,it is very necessary to design a reasonable training-based classifier for imbalanced data classification.In addition,with the increase of data size,combined with the preprocessing of algorithms for unbalanced data and the construction of classifiers,the computing costs are brought,imbalanced data classification efficiency has become an issue that must be considered.The processing characteristics of the random forest algorithm in constructing multiple independent and different decision trees and voting decisions are more in line with the requirements and standards of parallelization processing.Based on the research background,significance,basic concepts and related technologies of the selected topic,the paper firstly carried out a large number of literature reviews.Secondly,the imbalanced data classification faced with bottleneck problem like imbalanced samples,minority class samples with low classification accuracy,and classification efficiency.The paper combines spark's efficient data processing capabilities and proposes an imbalanced data classification algorithm based on random forest in the Spark environment.Firstly,the method samples comprehensive weights which obtained by the weights of the samples of each class in the majority samples and the sample sizes of the minority classes,the sample and the minority class samples are formed the training data sets of balanced scale.Secondly,the feature selection method based on relevance is used to select the optimal feature subset and the weighted voting method is used to improve and optimize the random forest algorithm and use it to obtain the sub-classifier.Finally,in the Spark environment,the UCI data set was used for experimental verification.The experimental results show that the paper method not only improves the overall classification accuracy,but also improves the classification efficiency.
Keywords/Search Tags:classification, imbalanced data, comprehensive weight, random forest, Spark, parallelization
PDF Full Text Request
Related items