Font Size: a A A

Comparative Study And Empirical Analysis Of Lasso Type Variable Selection Methods

Posted on:2022-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:D K HouFull Text:PDF
GTID:2480306311464074Subject:Statistics
Abstract/Summary:PDF Full Text Request
All statistical models are approximate simulations of reality.Model selection is the most important problem in statistics because no regression model is omnipotent for all kinds of data sets.In the face of each data set,we need to determine which method can fit the appropriate model to predict and infer.Therefore,the comparative analysis of various fitting models has become one of the hot topics in the field of statistics.Many statisticians have published many relevant papers to analyze and discuss the problems from different angles.Variable selection is a crucial part of modeling.In order to achieve the goal of having strong inference and prediction ability,a good model should be able to analyze the elements that are most related to the response variables from a large number of covariates introduced at the beginning of modeling.On the other hand,the model should be stable,and the results of variable selection should not be unstable affected by random noises and contaminated data.Longitudinal data is a set of repeated observation data of different individuals at different time points.Longitudinal data appear frequently in biomedicine,clinical trials,meteorological observation,and other fields,which is an important data type in the field of statistics.According to the different backgrounds of data and the different starting points to solve the problem,the dealing methods have different names,such as random effect model,hierarchical linear model,variance component model,and so on.The main work and characteristics of this paper are as follows:(1)to realize the comparative analysis of linear random effect models,a multivariate linear normal data set that is in line with the numerical simulation experiment is constructed.The mean value of the normal data set is designed to be a random variable that meets the normal distribution(simulating the internal fluctuation of individual predictors,the inhomogeneity between individuals and the error of each measurement.etc.),therefore the data set used for fitting is longitudinal data in the numerical simulation experiments carried out 300 times.The fitting model has a random effect.(2)In this paper,the linear random effect models based on four common variable selection methods(Lasso,Elastic-Net,Adaptive-Lasso,and SCAD)are compared and analyzed in the aspects of consistency of variable selection,accuracy and stability of model prediction.In this paper,the analysis does not apply the calculation of false discovery rate and false exclusion rate adopted in most of the relative articles to evaluate the consistency of variable selection of the model,but from the calculation of the angle between standardized estimation coefficient vector ? and the normalized true coefficient vector(Angle=180/?arccos(?T?)).With the help of the statistical analysis toolbox,the distribution and stability of the Angle(i.e.the number of outliers)in the process of 300 numerical simulations are described.The paper is divided into two parts.The theoretical part combines and summarizes the basic theoretical knowledge of the principle,characteristics and relationship of the variable selection of various penalty functions involved in this paper,and the key part demonstrates it.In the part of numerical simulation and demonstration,we first design data sets to meet the requirements of numerical simulation experiments to simulate three common application scenarios.(1)n>>p,the number of observation samples is much larger than that of prediction variables.(2)High dimensional data that satisfy log(p)=n?(0<a<1).(3)An outlier is added to some element Xij in the prediction variable matrix Xn×p on the basis of high dimensional data,simulating the actual situation of data pollution in the process of data collection.In this paper,it is agreed that the models discussed are all sparse models(presume that ? has at most k ? p nonzero terms)except for special statement.Then,four different variable selection methods are used to establish linear random effect models.We want to:(1)investigate the performance and characteristics of the fitted model in prediction accuracy,coefficient consistency and stability when the multicollinearity(measured by correlation coefficient variable)between prediction variables changes from small to large,thus suggestions about model fitting methods suitable for different application scenarios are proposed.(2)investigate the different performance and characteristics of the fitted models when the data collected are polluted.Finally,through empirical analysis,we show the application and characteristics of the four variable selection methods discussed in this paper.
Keywords/Search Tags:linear random effect model, variable selection, model prediction accuracy, coefficient consistency, boxplot, stability
PDF Full Text Request
Related items