Font Size: a A A

Study On Two-Stage Stepwise Variable Selection Based On Random Forests

Posted on:2017-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:P F FengFull Text:PDF
GTID:2180330485967120Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of various kinds of Automatic Data Acquisition, Mobile Internet, Internet of Things and Cloud Computing, it rapidly accumulate a large number of data in the economy, biology and other fields so as to heading for data more and more high dimension. The number of variables is much larger than the number of samples, which we call "big P small N" problem. Random Forest method is a commonly used tool for high dimensional data. It can not only be efficient, but also well deal with nonlinearity, interactions, correlated predictors and commonly avoid overfitting. The major characteristic of Random Forests is its inbuilt variable importance measures, which can be applied to various kinds of regression and classification problems, and extensively researched in the economy, biology and other fields. Therefore, we propose an algorithm of Two-Stage Stepwise Variable Selection Based on Random Forests (abbreviated as TSRF) in this paper. The main results are given as follows:1. A new variable importance measure is proposed. According to the literature, when it contains many noisy variables and relevant variables, it will affect the score of the real variables to be selected. Thus, we propose the improved method of variable importance measures based on Random Forests. The aim is to improve the dipartite degree between important variables and noise variables. Then simulation experiments used normal data and Genetic data in Biology show that the proposed method is effective and feasible.2. A new stepwise variable selection measure is proposed. Combine Random Forests with stepwise variable selection, we improve the method of stepwise variable selection based on Random Forests. The independent variables highly correlated with the dependent variable are selected. The noise variables are screened out. The new variable selection approach is more accurate than the traditional variable selection based on Random Forests. To examine the method proposed above, we conducted simulations. The feasibility and efficiency of the method is verified used normal data and Genetic data in Biology.There are two types of simulation used normal data, namely classification and regression data. Suppose that sample size N, the variable number P, variable correlation coefficient r and the group number’s effects on two-stage stepwise variable selection based on Random Forests. Suppose that the length of chromosome, chromosome numbers, the location quantitative trait locus on chromosome and the number of markers, we discuss two cases:single quantitative trait loci and multiple quantitative trait loci. Thus, to verify the feasibility and effectiveness of the method TSRF.3. The comparative analysis is given. An empirical research on makers’selection in grains per panicle data is given. Then the SCAD Penalized regression and Elastic Net regression are applied to dissect the example. Meanwhile, WinQTLcart2.5 is used to analyse grains per panicle data, which is the traditional Quantitative Trait Locus Mapping software. The result shows that TSRF accuratly filter varables and represents great improvement.Above analysis shows that Two-Stage Stepwise Variable Selection Based on Random Forests has great significance for variable selection in high-dimensional data in Economics and Biology.
Keywords/Search Tags:Random Forests, Variable Selection, Variable Importance Measures, Regression Analysis
PDF Full Text Request
Related items