Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest

Posted on:2018-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Yang

Full Text:PDF

GTID:2348330515469910

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

High dimensional data are widely used in real life,such as spam identification,fault diagnosis,face recognition and medical diagnosis.It is one of the most important research topics in the field of machine learning to improve the classification of high dimensional imbalanced data.Random forest algorithm is an integrated learning algorithm proposed by Breman,which is based on the combination of multiple decision trees.After the random forest algorithm is proposed,it is widely used in many fields due to its good performance.However,when the random forest algorithm is applied to the classification of high dimensional imbalanced data,the classification performance and decision tree size will be reduced.In this dissertation,we study and improve the random forest algorithm from the data level and algorithm level:(1)The DESMOTE algorithm is proposed to deal with the class imbalance problem of high-dimensional data.This algorithm is a data balance method on the data level,which improves the traditional SMOTE algorithm.Based on the DESMOTE-RF algorithm,as a random forest algorithm when making final vote weights to the AUC value,the algorithm in classification and prediction when the majority voting method to instead of original weighted voting method,in order to improve the performance of random forest algorithm in imbalanced data classification in.(2)On the basis of the DESMOTE-RF algorithm,the D-LPP-RF algorithm and the D-SR-RF algorithm are proposed for high dimensional imbalanced data classification.Before each node of the decision tree algorithm in the two division,the node data mapped to other attribute space by LPP or SR mapping,can quickly find the optimal splitting characteristics and the best splitting point in the space,the decision tree classifier to obtain the original attribute space approximation.The two algorithm greatly reduces the decision tree random forest algorithm in construction time,reduce the decision tree construction scale,increase the difference between decision tree and random forest algorithm significantly improves the AUC value,G-means value and Fmeasure value.(3)Finally,the D-LPP-RF and D-SR-RF algorithm proposed in this dissertation can be used in the diagnosis of cancer.The rise of gene expression data provides a new diagnostic method for cancer diagnosis,and the gene expression data is characterized by high dimensionality,imbalance and small sample size.The expression data classification algorithm proposed in the genes,and with the original random forest algorithm and three in gene expression data on the good performance of the classification algorithm are compared,and the D-SR-RF algorithm to verify the DLPP-RF expression classification performance data on gene.

Keywords/Search Tags:

Random forest algorithm, high-dimensional unbalanced data, decision tree, cancer diagnosis

PDF Full Text Request

Related items

1	The Research On Random Forest And Its Parallelization Oriented To Unbalanced High-dimensional Data
2	Evaluation Of Confounder-controlled Random Forest And Its Application In High Dimensional Data Analysis
3	Research On Multi-specification Cargo Loading Based On Improved Random Forest Algorithm
4	Random Forest Algorithm Research And Application Based On BLB Method
5	Analysis Of Individual Credit Evaluation Indicators Based On Random Forsets
6	A New Random Projection-Based Ensemble Classifier For High Dimensional Imbalance Data
7	Research On Code Plagiarism Detection Model Based On Random Forest And Gradient Boosting Decision Tree
8	Random Forest Based On Attributes Combination
9	Research On XGBoost Decision Tree Optimization And Its Application
10	Research On Road Detection Technology Based On Semi-random Forest Algorithm