Variable Importance Measure And Kernel Density Estimation Based On Random Forest

Posted on:2018-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:P Peng

Full Text:PDF

GTID:2428330515952523

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Random forest,as an important and common algorithm of data mining,has advantages of high classification performance,few parameters,fast and efficient operation,tolerating noise,etc.Besides,random forest can estimate variable importance,OOB error and proximity of samples.These superior performances make it widely used and studied in various fields.The OOB error of decision trees can't truthfully reflect the generalization performance of random forest,and it may not change at all when randomly permuting a variable,especially in the high-dimensional data for there are lots of correlation variables.Consequently,the traditional variable importance measure(VIM)of random forest based on error of decision trees may be unreliable and instability as a feature selection method.Faced with the above problems,this paper presents a new variable important measure based on margin series(VIM-MS),which uses the similarity of two margin series,before and after randomly permuting a variable,to measure variable importance.We combine stability of feature selection algorithm with predictive accuracy to assess the performance of VIM and VIM-MS.The experiments on gene data sets and UCI data sets show that the stability of VIM-MS is superior to VIM,and the superiority doesn't sacrifice classification performance.The methods of ensemble of probability estimation trees mainly include leaf frequency,Laplace estimation,m estimation.However,the posterior probabilities measured by these methods have been proved to be highly bias and highly reliable,and these methods neglect the difference of test examples when they fall into a same leaf node.To cure the above problems,this paper proposes improved and novel probabilities estimators combining random forest with kernel density estimation(RFPE-KED).The class conditional density function is estimated by kernel estimation,then the posterior probabilities can be computed by Bayes' formula,which can provide risk probabilities for the classification of random forest.In order to apply kernel density estimator to high dimensional data,we utilize random forest for dimension reduction.We perform local kernel density estimators in the reduced subspaces corresponding to trees of random forest(RFPE-KEDI),nodes of random forest(RFPE-KEDII),proximity of random forest(RFPE-KEDIII),and we also present two kernel density estimators in the random subspaces(RFPE-KEDIV)and original space(RFPE-KEDV)to contrast.And we use the MSE to assess the performance of ensemble of probability estimation trees and RFPE-KED.Preliminary experiments on synthetic data show that RFPE-KED provides more accurate probability estimates.

Keywords/Search Tags:

random forest, variable importance, margin, probability, kernel density estimation

PDF Full Text Request

Related items

1	Novel Random Forest and Variable Importance Methods for Clustered Dat
2	Random Forest Robustness, Variable Importance, and Tree Aggregatio
3	Several Research On Random Forest Improvement
4	On The Particle Filter Algorithm And Its Circuit Implementation
5	Probability density estimation on a high-dimensional space using random tessellations
6	Research On Tracking Algorithm For Visual Targets Based On Random Finite Set
7	A Variable Probability Adaptive Random Testing Approach Based On Restricted Selection And Program Information
8	Research On Palmprint Preprocessing In Mobile Environment
9	Research On Probability Hypothesis Density Filtering Alogrithm With Unknown Noise Statistics
10	Feature Selection Approaches Based On Weighted Kernel Density Estimation