Font Size: a A A

A New Random Projection-Based Ensemble Classifier For High Dimensional Imbalance Data

Posted on:2020-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q HuangFull Text:PDF
GTID:2428330575465850Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the development of the times,high-dimensional imbalanced data are fre?quently seen in many fields,such as genetic data,signal data,financial data.How to effectively classify high-dimensional imbalanced data is an important research direc-tion.We proposes an ensemble method of decision trees based on random projection;and uses threshold-moving to extend the method to high-dimensional imbalanced data.In Chapter 2,for the high dimensional classification,we propose an ensemble method of decision trees based on random projection,Projection Forest(PJForest).The method uses decision tree as the base classifier.First,multiple random projections are generated to reduce the dimensionality of the data,and then train the decision tree based on the low dimension data,finally combine multiple decision trees to make more accu-rate predictions through majority voting rule.The benefit of using random projection is two-fold.It can preserve the information of geometrical relationship in dimension reduction.More importantly,it can construct a number of good and different decision trees by perturbing the original data,which can enrich the diversity of ensemble method,overcome the effect of noises and enhance PJForest's generalization.We demonstrate the limit property of PJForest generalization error.The results of the simulation study show that PJForest can effectively classify high dimensional data containing a lot of noises,and has better classification performance than existing methods such as random forest and Xgboost.In the end,real data analysis is also carried out.In Chapter 3,we extend the PJForest to the case of high-dimensional imbalanced data,proposes a PJForest method based on threshold-moving,Banlance Proj ection For-est(BPJForest).By changing the voting threshold and moving the decision boundary,enhancing the classification performance for the minority class,so BPJForest can ef-fectively classify high-dimensional imbalanced data.When Balanced accuracy is used as the performance measure,we gives an optimal threshold selection method.And then,the limit property of the PJForest generalization error is extended to BPJForest,and similar theoretical result is obtained.The results of the simulation study show that BPJForest can effectively classify high-dimensional imbalanced data,and has better classification performance than existing methods such as PJForest and RPF.
Keywords/Search Tags:decision tree, diversity, high dimensional classification, imbalanced, en-semble learning, random projection, threshold-moving
PDF Full Text Request
Related items