Comparative Study And Application Of Several Variable Selection Methods

Posted on:2022-12-03

Degree:Master

Type:Thesis

Country:China

Candidate:M K Sun

Full Text:PDF

GTID:2480306731994639

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In the era of big data,how to get the most valuable information quickly is a concern of many scholars or enterprises in the face of huge amounts of data.Among them,when the dimensionality of the data is high,it is necessary to select variables that play important roles in decision-making and other related issues.Variable selection can reduce the sample dimension,improve the performance of learning algorithms and the robustness of the model.After a lot of scholars’ research,the methods used for variable selection are very rich at present,but different methods will have different characteristics.The purpose of this paper is to analyze four typical and commonly used variable selection methods,compare their related performance,and provide some reference suggestions for using these algorithms in different situations.Firstly,this article describes the basic principles of four variable selectors based on Lasso,random forest,XGBoost,and gradient learning.Then,by constructing simulation data containing important and non-important variables under different sample sizes,the performance of these four different variable selection methods in selecting the number and accuracy of important variables is compared and analyzed.By comparing the frequency and time analysis of selected variables,it is found that the methods based on Lasso and XGBoost have poor stability,the variable selection methods based on random forest and gradient learning are more stable.In terms of accuracy,the variable selection methods based on random forest and gradient learning are the best,the method based on Lasso is the worst.In terms of complexity,the method based on Lasso is the shortest on average,and the variable selection methods based on random forest and gradient learning take longer.Furthermore,after many experiments using different variable selection methods on three real data,XGBoost regression model and random forest regression model are compared according to the variables selected in multiple frequencies of each method,and it is concluded that the model after variable selection has better effect in some cases.The model based on the variable selection method based on gradient learning has achieved the best results on dataset 1,but has underperformed on other two ones,which indicates that different variable selection methods can behave very differently on different data,and multiple variable selection methods can be used to compare to build the best model.Finally,the relevant conclusions and work prospects are given.

Keywords/Search Tags:

Variable selection, Lasso, Random forest, XGBoost, Gradient learning

PDF Full Text Request

Related items

1	Research On The Advantages And Disadvantages Of Lasso And Its Improved Methods In Variable Selection
2	Comparison And Analysis Of Variable Selection Methods In Classical Statistics And Machine Learning
3	Meta Analysis Based On Random Lasso
4	Application Of Artificial Intelligence In Navigation Positioning
5	Comparison Of Several Methods For Generating Directed Acyclic Graph By Variable Selection
6	Research On Rainfall In Fujian Province Based On Random Forest Algorithm
7	Random Lasso Method In Logistic Regression
8	Comparative Study And Empirical Analysis Of Lasso Type Variable Selection Methods
9	Study On The Applications Of Random Lasso In Logistic Model
10	Summary Of Lasso Variable Selection Methods