Font Size: a A A

Measurement And Analysis Of The Consistency Of Training And Test Data Distribution

Posted on:2022-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:R Y ChuFull Text:PDF
GTID:2507306509469764Subject:Statistics
Abstract/Summary:PDF Full Text Request
In statistical machine learning,in order to evaluate the performance of the model,the data set is usually divided into training data and test data.Among them,the training data is used to construct the prediction model,while the test data is used to evaluate the performance of the constructed prediction model.In the above model evaluation process,a basic assumption is that the distribution of training and test data is consistent,but in practical application,this basic assumption is often ignored by researchers.For example,in actual tasks such as spam filtering,clinical trial research,signal transmission in bioinformatics,etc.,the distribution of training and test data is often inconsistent.In this case,if the inconsistency between them is ignored,it may lead to problems such as biased estimation of model selection and degradation of the generalization performance of the model.Therefore,consistency measurement and analysis of training and test data distribution are crucial in statistical machine learning research.As a matter of fact,the commonly used measurement methods of distributed consistency include the measurement method based on sample weighting,the measurement method based on hypothesis testing and the measurement method based on various measurement functions.Among them,the measurement method based on the KL distance measurement function has attracted special attention of researchers due to its simplicity and effectiveness.However,the study found that the value range of KL distance is [0,+∞],and directly measuring the difference between the distribution of training and test data based on KL distance may not provide an appropriate general metric,because different data sets may lead to great changes in the KL distance between the training and test data.Based on this,this paper normalized the KL distance between training and test data obtained from different data sets to the value range of [0,1] by using logit transform,and provided an appropriate general measurement criterion to measure the distribution consistency of training and test data.Then,it is proved theoretically that the proposed metric has the sign-preserving property,that is,when the sample size tends to infinity,the conclusion of uniform distribution can be obtained by using the general metric under finite samples.The rationality and effectiveness of the proposed general measurement criteria are verified by a large number of simulation and real data experiments.Furthermore,a new method based on distribution weighting is proposed to correct the distribution difference of training and test data,so that the distribution of the corrected training and test data is consistent,thus avoiding the problem of the model performance decline caused by the inconsistency of distribution.In addition,the KL distance metric is also applied to the selection of the number of folds in the K-fold cross validation,and a selection criterion of the number of folds in the K-fold cross validation based on the regularization of the KL distance is proposed.The validity of the proposed criterion is verified by several real data experiments.
Keywords/Search Tags:KL(Kullback-Leibler) distance, Distribution consistency, Metrics, K-fold cross-validation, A selection of the fold K
PDF Full Text Request
Related items