Font Size: a A A

Model Selection Method Based On Block-regularized Cross-validation

Posted on:2022-03-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L YangFull Text:PDF
GTID:1488306509966419Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Statistical machine learning aims to make predictions on new data records based on a well-established prediction model that is constructed on a training dataset.Model selection is a key step in the model construction process.Two important subtasks in model selection were implemented in this study.One is the ensemble feature selection,which is based on data-driven cross-validation(CV),and the other is the tuning parameter selection.Detailed research achievements of the two subtasks are as follows.1.A majority-voting-based ensemble feature selection method is proposed based on a block-regularized CV(m×2 BCV).Many recent studies have illustrated that the performance of a feature selection method can be improved by employing the majority-voting method on multiple feature selection results in a CV.Performance analysis essentially relies on the distribution description of the estimation of the feature selection probability.However,existing studies typically made a strong independence assumption on the multiple feature selection results in a CV and adopted a binomial distribution to approximate the distribution of the estimation of the feature selection probability.Unfortunately,the approximation has a large divergence from the true distribution of the estimation.Moreover,the commonly used CV methods in the feature selection task neglect the randomness in the data partitioning,which negatively affects the performance of the feature selection method.In contrast,an m × 2 BCV regularizes the number of overlapping samples among training sets and suppresses the randomness in the data partitioning,thereby reducing the variance in the estimation of the generalization error.In particular,an m × 2 BCV possesses both inter-group and intra-group correlation coefficients,which are very useful for accurately formulizing the correlation among multiple feature selection results.Therefore,an ensemble feature selection based on the m × 2 BCV(EFSBCV)is proposed in this study.The proposed EFSBCV method not only reasonably considers the correlation among multiple feature selection results,but also employs a beta distribution to more accurately represent the distribution of the estimation of the feature selection probability.Furthermore,a mild condition on the correlation is theoretically developed to guarantee a high probability of selection of the relevant feature and a high accuracy using the EFSBCV method.A boundary for the selection probability is presented with regard to the repetition count,namely m.The superiority of the beta distribution that represents the distribution of the estimation of the feature selection probability is well illustrated by extensive experiments.Numerous experiments were also conducted to compare the EFSBCV method with the previous StabSel(stability selection)and CPSS(complementary pairs stability selection)methods.The experimental results show that the EFSBCV method exhibits better performance in selecting relevant features than the StabSel and CPSS methods,and the three methods exhibit equivalent performances for the elimination of irrelevant features.2.A novel tuning parameter selection method was developed based on the m × 2 BCV.Conventional CV-based tuning parameter selection methods must theoretically satisfy a harsh condition in a high-dimensional regression setting to ensure consistency.In contrast,the proposed m × 2BCV-based tuning parameter selection method has a mild and general condition in terms of consistency,and the condition does not restrict the training and validation sets and the repetition count.Furthermore,extensive experiments were conducted on commonly used high-dimensional regression models,including the Lasso(least absolute shrinkage and selectionator operator),MCP(minimax concave penalty)and SCAD(smoothly clipped absolute deviation).The proposed m × 2 BCV-based tuning parameter selection was experimentally compared with selection methods based on other commonly used CV,such as hold-out,5-fold,and 10-fold CV methods.The results show that the proposed tuning parameter selection method outperforms others methods.In conclusion,the m×2 BCV method is superior to other CV methods for the model selection in the statistical machine learning domain.
Keywords/Search Tags:Model selection, Block-regularized cross-validation, Ensemble feature selection, Tuning parameter selection, Model evaluation criteria
PDF Full Text Request
Related items