With the development of big data,the ways of collecting data have diversified and the data structures used for modeling and analysis have become increasingly diverse.When data collection for predictor variables is very costly or data collection is difficult,there will be a situation where only a small fraction of the samples in the data set are labeled samples and the majority of the samples are unlabeled.At this point,the available sample size for supervised learning is very small and most unlabeled samples are not utilized,which motivates us to consider high-dimensional semi-supervised data as a research topic of interest to be applied to new data contexts in conjunction with existing research results.In this paper,we consider this situation and improve supervised learning based models by exploiting information from unlabeled samples.In addition,the fitted function models that can describe the mechanism of data generation are obtained by machine learning and are widely used in economic analysis,biomedical,text,image,etc.In the application domain,one is more concerned with the predictive power of the model.Therefore,there have been many studies on model selection,that is,selecting the optimal model based on its predictive performance.However,the single model chosen for model selection is subject to uncertainty and risks producing undesirable results.To reduce the uncertainty due to model selection in the modeling process and to improve the model prediction performance,some scholars have proposed model averaging methods.As the dimensionality of data grows,many screening methods and model averaging methods have been developed for high-dimensional data.However,existing studies have conducted data analysis in the case of complete cases(i.e.,with labeled data),and less attention has been paid to the case where a large amount of unlabeled data exists.In particular,when the collected data have a large amount of unlabeled data and the amount of labeled data is insufficient,the predictive performance of existing methods will be affected to some extent.How to improve the model prediction performance based on existing model averaging methods using unlabeled data information is an issue worth study.The main work of this paper is to develop a sequential model averaging-based prediction method for robust prediction of high-dimensional semi-supervised data in a semi-supervised framework.It is divided into two steps.First,univariate model averaging is performed using semi-supervised samples(both labeled and unlabeled samples),the weights of the candidate models are determined by the extended BIC criterion,and the candidate model regression coefficients are estimated using semi-supervised data.Sequential model averaging is then performed,with each step updating the response variables with the residuals obtained from the previous regression step.The innovation of this paper has the following points,firstly,most of the existing semi-supervised learning methods are used for classification,while there are relatively few studies related to semi-supervised regression methods.The method proposed in this paper utilizes the information of unlabeled samples to perform regression prediction,i.e.,it is applicable to the case where the response variable is a continuous variable.Secondly,the method in this paper makes it possible to determine each candidate model and its weights for model averaging in a low-dimensional framework by univariate model averaging,thus ensuring the computational feasibility for high-dimensional(even ultra-high-dimensional)regression.Finally,by using a sequential screening procedure for univariate model averaging,this method can effectively adjust the weights and avoid overfitting in the model averaging stage.Simulation experiments are conducted to compare the prediction performance of the proposed method with the commonly used model selection and model averaging methods for high-dimensional regression problems,as well as the prediction performance in the presence of outlier interference and model misspecification,and the proposed method shows a more robust prediction performance. |