Font Size: a A A

Comparative Study And Application Of Several Variable Selection Methods

Posted on:2022-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:M K SunFull Text:PDF
GTID:2480306731994639Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In the era of big data,how to get the most valuable information quickly is a concern of many scholars or enterprises in the face of huge amounts of data.Among them,when the dimensionality of the data is high,it is necessary to select variables that play important roles in decision-making and other related issues.Variable selection can reduce the sample dimension,improve the performance of learning algorithms and the robustness of the model.After a lot of scholars' research,the methods used for variable selection are very rich at present,but different methods will have different characteristics.The purpose of this paper is to analyze four typical and commonly used variable selection methods,compare their related performance,and provide some reference suggestions for using these algorithms in different situations.Firstly,this article describes the basic principles of four variable selectors based on Lasso,random forest,XGBoost,and gradient learning.Then,by constructing simulation data containing important and non-important variables under different sample sizes,the performance of these four different variable selection methods in selecting the number and accuracy of important variables is compared and analyzed.By comparing the frequency and time analysis of selected variables,it is found that the methods based on Lasso and XGBoost have poor stability,the variable selection methods based on random forest and gradient learning are more stable.In terms of accuracy,the variable selection methods based on random forest and gradient learning are the best,the method based on Lasso is the worst.In terms of complexity,the method based on Lasso is the shortest on average,and the variable selection methods based on random forest and gradient learning take longer.Furthermore,after many experiments using different variable selection methods on three real data,XGBoost regression model and random forest regression model are compared according to the variables selected in multiple frequencies of each method,and it is concluded that the model after variable selection has better effect in some cases.The model based on the variable selection method based on gradient learning has achieved the best results on dataset 1,but has underperformed on other two ones,which indicates that different variable selection methods can behave very differently on different data,and multiple variable selection methods can be used to compare to build the best model.Finally,the relevant conclusions and work prospects are given.
Keywords/Search Tags:Variable selection, Lasso, Random forest, XGBoost, Gradient learning
PDF Full Text Request
Related items