Font Size: a A A

Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest

Posted on:2018-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:H Y YangFull Text:PDF
GTID:2348330515469910Subject:Software engineering
Abstract/Summary:PDF Full Text Request
High dimensional data are widely used in real life,such as spam identification,fault diagnosis,face recognition and medical diagnosis.It is one of the most important research topics in the field of machine learning to improve the classification of high dimensional imbalanced data.Random forest algorithm is an integrated learning algorithm proposed by Breman,which is based on the combination of multiple decision trees.After the random forest algorithm is proposed,it is widely used in many fields due to its good performance.However,when the random forest algorithm is applied to the classification of high dimensional imbalanced data,the classification performance and decision tree size will be reduced.In this dissertation,we study and improve the random forest algorithm from the data level and algorithm level:(1)The DESMOTE algorithm is proposed to deal with the class imbalance problem of high-dimensional data.This algorithm is a data balance method on the data level,which improves the traditional SMOTE algorithm.Based on the DESMOTE-RF algorithm,as a random forest algorithm when making final vote weights to the AUC value,the algorithm in classification and prediction when the majority voting method to instead of original weighted voting method,in order to improve the performance of random forest algorithm in imbalanced data classification in.(2)On the basis of the DESMOTE-RF algorithm,the D-LPP-RF algorithm and the D-SR-RF algorithm are proposed for high dimensional imbalanced data classification.Before each node of the decision tree algorithm in the two division,the node data mapped to other attribute space by LPP or SR mapping,can quickly find the optimal splitting characteristics and the best splitting point in the space,the decision tree classifier to obtain the original attribute space approximation.The two algorithm greatly reduces the decision tree random forest algorithm in construction time,reduce the decision tree construction scale,increase the difference between decision tree and random forest algorithm significantly improves the AUC value,G-means value and Fmeasure value.(3)Finally,the D-LPP-RF and D-SR-RF algorithm proposed in this dissertation can be used in the diagnosis of cancer.The rise of gene expression data provides a new diagnostic method for cancer diagnosis,and the gene expression data is characterized by high dimensionality,imbalance and small sample size.The expression data classification algorithm proposed in the genes,and with the original random forest algorithm and three in gene expression data on the good performance of the classification algorithm are compared,and the D-SR-RF algorithm to verify the DLPP-RF expression classification performance data on gene.
Keywords/Search Tags:Random forest algorithm, high-dimensional unbalanced data, decision tree, cancer diagnosis
PDF Full Text Request
Related items