Font Size: a A A

Variable Importance Measure And Kernel Density Estimation Based On Random Forest

Posted on:2018-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:P PengFull Text:PDF
GTID:2428330515952523Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Random forest,as an important and common algorithm of data mining,has advantages of high classification performance,few parameters,fast and efficient operation,tolerating noise,etc.Besides,random forest can estimate variable importance,OOB error and proximity of samples.These superior performances make it widely used and studied in various fields.The OOB error of decision trees can't truthfully reflect the generalization performance of random forest,and it may not change at all when randomly permuting a variable,especially in the high-dimensional data for there are lots of correlation variables.Consequently,the traditional variable importance measure(VIM)of random forest based on error of decision trees may be unreliable and instability as a feature selection method.Faced with the above problems,this paper presents a new variable important measure based on margin series(VIM-MS),which uses the similarity of two margin series,before and after randomly permuting a variable,to measure variable importance.We combine stability of feature selection algorithm with predictive accuracy to assess the performance of VIM and VIM-MS.The experiments on gene data sets and UCI data sets show that the stability of VIM-MS is superior to VIM,and the superiority doesn't sacrifice classification performance.The methods of ensemble of probability estimation trees mainly include leaf frequency,Laplace estimation,m estimation.However,the posterior probabilities measured by these methods have been proved to be highly bias and highly reliable,and these methods neglect the difference of test examples when they fall into a same leaf node.To cure the above problems,this paper proposes improved and novel probabilities estimators combining random forest with kernel density estimation(RFPE-KED).The class conditional density function is estimated by kernel estimation,then the posterior probabilities can be computed by Bayes' formula,which can provide risk probabilities for the classification of random forest.In order to apply kernel density estimator to high dimensional data,we utilize random forest for dimension reduction.We perform local kernel density estimators in the reduced subspaces corresponding to trees of random forest(RFPE-KEDI),nodes of random forest(RFPE-KEDII),proximity of random forest(RFPE-KEDIII),and we also present two kernel density estimators in the random subspaces(RFPE-KEDIV)and original space(RFPE-KEDV)to contrast.And we use the MSE to assess the performance of ensemble of probability estimation trees and RFPE-KED.Preliminary experiments on synthetic data show that RFPE-KED provides more accurate probability estimates.
Keywords/Search Tags:random forest, variable importance, margin, probability, kernel density estimation
PDF Full Text Request
Related items