| Variable selection has always been a focus topic in statistical research.With the development of data acquisition technology,high-dimensional and complex data have emerged in various research fields,bringing new challenges to the technique of variable selection.Traditional methods such as all subset regression require searching through all candidate models to select the globally optimal subset,which becomes an NP problem when the number of variables increasing.Variable selection methods based on penalty functions,such as LASSO(Least Absolute Shrinkage and Selection Operator;Tibshirani,1996),SCAD(Smoothly Clipped Absolute Deviation;Fan and Li,2001),ALASSO(Adaptive LASSO;Zou,2006),and recently proposed methods like ABESS(Adaptive Best-Subset Selection;Zhu et al.,2020),have overcome the dimensional limitation and poor stability of full subset selection(Niu Yong et al.,2021).However,they are sensitive to strong correlations between variables,and when the number of variables is large,the accuracy of variable selection will significantly decrease.In addition,many methods are only proposed for linear models or specific model structures,and applying them to more complex models require strong technical skills.Therefore,variable selection for high-dimensional complex data still faces three challenges: statistical accuracy,algorithmic stability,and model interpretability(Fan et al.,2009).Qian and Field(2002)proposed a best subset random search method based on Markov Chain Monte Carlo(MCMC)for Logistic regression models.This random search method can take into account the joint information of different variable combinations during the model search process,and has advantages when there are complex correlations between variables.However,Qian and Field(2002)and subsequent research work only considered the situation where p<n,and relevant research work in high-dimensional situations where p ≥n has not yet been carried out.Secondly,random search methods for variable selection in linear models and linear mixed-effects models have not yet received attention and research.Given the special advantages of the random search method in variable selection,in this article we study the random search method for variable selection in linear regression models and linear mixed-effects models in both p<n and p ≥n situations,aiming to improve the accuracy of variable selection compared to some existing methods.Four specific aspects of work have been carried out:(1)a Bayesian Information Criterion(BIC)induced stochastic search method is proposed for variable selection in linear models,and the algorithm for the implementation of the method is presented.The accuracy of variable selection is compared with the ABESS method proposed by Zhu et al.(2020)and the traditional All Subset Regression(ASR)method through simulation experiments.The effectiveness of the stochastic search method in selecting variables is discussed;(2)the random search strategy that iteratively updates the “best subset” and variance parameters is proposed for linear model variable selection in high-dimensional situations,and the algorithm for the application of the method is given.The effectiveness of the random search method of variable selection is discussed compared with the ABESS method and the LASSO method through simulation experiments;(3)the random search method for variable selection in linear mixed-effects models is studied.The random search methods for selecting fixed-effects variables when random effect variables are given,selecting random-effects variables when fixed-effects variables are given,and selecting variables from both parts are studied separately.The algorithm for implementing the method is provided.The accuracy of variable selection is compared with the method proposed by Bondell et al.(2010)and the method proposed by Fan and Li(2012)through simulation experiments,and the effectiveness of the random search method is discussed;and(4)an iterative stochastic search method is proposed based on updating the both “best subset” to select both fixed-effects and random effects variables simultaneously.The algorithm for implementing the method is provided.Through simulation experiments,the effectiveness of the random search method for variable selection in high-dimensional situations is discussed.The main results of the research are shown for the following:(1)the BIC induced random search methods proposed in this article can effectively select important variables for linear models.When variables are uncorrelated,the performance of stochastic search method is as well as both the ABESS method and the LASSO method.Furthermore,the accuracy of selecting variables by the random search method is significantly higher than that of the ABESS method and the LASSO method when variables are complexly correlated.Therefore,in situations where there are complex correlations among variables,the variables selected by the random search method are more reliable than that selected by ABESS method and the LASSO method;(2)for variable selection in linear mixed-effects models,the random search method can effectively select those important variables for both the fixed-effects and the randomeffects variables simultaneously.In the same simulation scenario,the accuracy of variable selection using the random search method is higher than that of the method proposed by Bondell et al.(2010).Compared with the method proposed by Fan and Li(2012),the random search method can better balance the accuracy of variable selection for both parts variables,with the average accuracy of both parts being superior to that of the Fan and Li(2012)method;and(3)for variable selection of linear mixed-effects models in high-dimensional settings,the random search method proposed in this article can also effectively select significant variables for both the fixed-effects and randomeffects parts simultaneously.Compared with low-dimensional settings,the accuracy of variable selection using the random search method does not decrease significantly as the number of variables increases under the same conditions.Therefore,in highdimensional settings,the two parts of variables selected by the random search method still possess high reliability.The innovation of our work lies in:(1)different from Qian and Zhao’s(2007)variable selection method based on probability threshold,we introduce a best subset selection method based on the nested model series sorted by the importance of variable which improves the efficiency of model search,and can be directly applied to variable selection in high-dimensional settings;(2)addressing the issue of not being able to directly search for the best subset among all subsets when the number of variables exceeds the sample size in high-dimensional settings,we build an iterative algorithm for variable selection by iteratively updating of variance parameters and the “best subset”;(3)the random search method takes into account the joint information of variable combinations during the model search process,and preserves the correlation information between variables when they are complexly correlated.Therefore,our work provides a reliable methodological supplement for variable selection in complex correlated data;and(4)compared with some methods for variable selection in linear mixed-effects models,the random search method can better balance the accuracy of both fixed-effects and random-effects variables,resulting in more reliable variable selection.Therefore,our research provides an effective method supplement and new research ideas for variable selection in linear mixed-effects models. |