| The main purpose of statistical machine learning is to establish a predictive model based on the training datasets to describe the statistical laws of given data,and to predict new data through existing models.Among them,the establishment and selection of models are the key issues.The so-called model selection refers to select the best model by estimating the performance of different established models.In traditional statistical regression analysis,model selection refers to the choice of variables,which has been a key research in statistics since the1960 s.Among them,the model selection of the classification case mainly includes two aspects: one hand is the selection of the classifier(classification algorithm),that is,for a given dataset,selecting the best one among multiple classifiers based on a certain performance metric;On the other hand is the selection of features(variables),that is,selecting a feature combination with the best performance.In the existing literature,the selection of regression and classification models is often directly performed based on the estimation of generalization error,such as the cross-validation estimations of generalization error are widely used to select models.However,it is noted that the methods based on these estimations only use the estimations itself(the information of the mean value)in the process of selecting the model,without considering the information of the variance of estimations,so that the large variance will lead to large fluctuations of the model,and tend to select more complex models,which will lead to lower generalization performance of the model.Therefore,this paper adds the variance into the traditional regression and classification model selection criteria as a regularization term,and proposes a new variance-regularized model selection criterion under thecross-validation framework.Firstly,the importance of the variance regularization term in model selection is verified by simulation experiments.Then,a large number of experiments on the simulation and real data verify that the proposed variance-regularized model selection criterion can select a simpler model with smaller generalization error than the traditional model selection methods in both regression and classification tasks.Furthermore,it is theoretically proved that the proposed model selection criterion based on variance-regularized cross-validation has the consistency in selection,that is,the optimal model selected under finite samples will also be optimal when the sample tends to infinity,which ensures the stability of model selection. |