Font Size: a A A

Research On Block-regularized Cross-Validation Methods For Comparing Supervised Algorithms

Posted on:2020-10-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:R B WangFull Text:PDF
GTID:1368330578972952Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Machine learning model is an essential component in data-driven intelligent systems and is usually derived by employing a learning algorithm on a massive data set.Increasingly upgrading an intelligent system requires a reasonable algorithm comparison method to select the algorithm with the highest performance.In fact,algorithm comparison is a fundamental task to answer whether a novel algorithm is significantly better than a baseline algorithm,and it uses a reasonable statistical significance test to draw a reliable conclusion.An algorithm comparison method is usually used in major phases in modeling process,including algorithm selection,feature selection,model selection and evaluation.Therefore,algorithm comparison is very important to modeling process.This study concentrates on comparing two supervised algorithms.Algorithm comparison is commonly defined as: Given two learning algorithms and a data set,which algorithm will produce more accurate models when trained and validated on data sets of the same size as the given data set? Algorithm comparison is usually formulated into a hypothesis testing problem,which is addressed with a reasonable significance test method on the basis of a well-designed cross-validation?CV?.The t-tests based on five-fold?ten-fold?CV is usually preferred due to its simplicity.However,because the tests adopt down-biased variance estimators,they have a drawback of uncontrollable type I errors,and thus they tend to make false positive conclusions.Although the drawback is improved in 5 × 2 CV t-test and F-test,these tests usually make unreliable conclusions due to the random partitioning of 5×2 CV.Therefore,for algorithm comparison on an IID data set,in this study,the data partitioning is optimized and block-regularized CV?BCV?methods are proposed.A reasonable variance estimator of a BCV estimator is adopted and a novel sequential test method is constructed.From theoretical and experimental perspectives,it is illustrated that the sequential test has a reasonable type I error and can make reliable conclusions.Furthermore,BCV are generalized on text data sets by using the regularization conditions on discrete distribution of prediction labels.Moreover,based on BCV,the posterior distributions of commonly used evaluation metrics,including precision,recall and F1 measure,are elaborated,and a Bayes test method on the basis of the posterior distributions is proposed for algorithm comparison.In this study,BCV methods are developed and their theoretical properties and construction methods are investigated.Specifically,for repeated learning-testing?RLT?estimator of the generalization error,poor data partitioning may cause a large overlapping samples between any two training sets of RLT,and enlarge the variance in the RLT estimator.Therefore,through introducing a regularization condition on the number of the overlapping samples,the data partitioning of RLT is optimized to reduce the variance in the RLT estimator,and a block-regularized RLT?BRLT?is developed.Several efficient construction methods of BRLT are also proposed.Moreover,as a special version of RLT,m × 2 CV has been widely used in algorithm comparison.Therefore,an optimization of data partitioning in m × 2 CV is further considered.Specifically,the effect of the number of overlapping samples on the variance of m × 2 CV estimator of the generalization error is investigated,and a regularization condition is used to constrain the number of overlapping samples to n/4 where n is data set size,and further the m × 2 BCV,in which the variance of the estimator could be reduced,is proposed,and an efficient nested construction of m×2 BCV is developed.For a text data set,chi-squared statistics are used to measure differences of multiple frequency distributions between training and validation sets.Based on the chi-squared statistics,several regularization conditions are introduced to further optimize m×2BCV.Considerable experiments on typical IID data sets and text data sets illustrate the novelty of the developed BCV methods.In this study,algorithm comparison is formulated into a hypothesis testing problem,and a statistical inference based on m × 2 BCV is investigated.Because the training sets of m × 2 BCV possess overlapping samples,the hold-out estimators of the generalization error in m×2 BCV are evidently correlated,which makes the inference on m×2 BCV is distinguishable to the conventional inference methods on IID observations.The upper and lower bounds of the correlation coefficients in an m × 2 BCV estimator is investigated on the basis of abundant theoretical analysis and considerable simulations.Furthermore,a reasonable variance estimator of m × 2 BCV estimator is proposed,and a t-statistic is developed.With a reasonable setting of the correlation coefficients,a conservative t-test statistic is constructed,and sequential confidence intervals?CIs?are given.Different to the conventional sequential tests on IID observations,when repetition count m tends to infinity,the expectation of the length of the sequential CIs in this study converges to a positive constant,which lead to the inexistence of stopping times of the test in some situations.Thus,the reduction rate of the expectation of the length of the sequential CIs is considered as a criterion to select the maximum of stopping time of the sequential t-test.The theoretical and simulated comparisons on several existing tests and the proposed sequential t-test illustrate that the sequential t-test is capable to control the type I error and possesses higher power,and the test makes reliable conclusions.The comparison also shows the necessity of conducting a sequential test rather than fixed size m tests.Moreover,for a text data set,precision,recall and F1 measure are commonly-used evaluation metrics,and their distributions are skewed.Thus,the sequential t-test is not suitable to these metrics.Therefore,the relationship of the posterior distributions of the three metrics with the correlation coefficients in m × 2 BCV estimators are elaborated,and accurate posterior distributions and reasonable CIs of precision,recall and F1 measure are provided,and a Bayes test on the posterior distributions is proposed for algorithm comparison.Several experiments on Chinese word segmentation task and named entity recognition task are conducted to illustrate the validity of the Bayes test.In this study,the task of software defect prediction is taken as an example to illustrate an application of m × 2 BCV.In a defect count prediction model,the m × 2 BCV sequential t-test is applied to test whether the aggregated features make a significant improvement in the model performance.For defect-prone prediction task,the Bayes test on the basis of m × 2 BCV is applied to test the significance between the model performances of logistic regression and random forest in terms of precision,recall and F1 measure.In this study,BCV and the corresponding statistical inference methods are proposed to guarantee the reliability of algorithm comparison conclusions,which have important sense to modeling process of supervised learning methods.The regularization methods in optimal design of data partitioning can be used in sub-sampling on massive data sets to guide the modeling process of distributional learning algorithms.
Keywords/Search Tags:Regularization, Cross validation, Data partitioning, Algorithm comparison, Statistical inference, Software defect prediction
PDF Full Text Request
Related items