Measurement And Analysis Of The Consistency Of Training And Test Data Distribution

Posted on:2022-08-03

Degree:Master

Type:Thesis

Country:China

Candidate:R Y Chu

Full Text:PDF

GTID:2507306509469764

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

In statistical machine learning,in order to evaluate the performance of the model,the data set is usually divided into training data and test data.Among them,the training data is used to construct the prediction model,while the test data is used to evaluate the performance of the constructed prediction model.In the above model evaluation process,a basic assumption is that the distribution of training and test data is consistent,but in practical application,this basic assumption is often ignored by researchers.For example,in actual tasks such as spam filtering,clinical trial research,signal transmission in bioinformatics,etc.,the distribution of training and test data is often inconsistent.In this case,if the inconsistency between them is ignored,it may lead to problems such as biased estimation of model selection and degradation of the generalization performance of the model.Therefore,consistency measurement and analysis of training and test data distribution are crucial in statistical machine learning research.As a matter of fact,the commonly used measurement methods of distributed consistency include the measurement method based on sample weighting,the measurement method based on hypothesis testing and the measurement method based on various measurement functions.Among them,the measurement method based on the KL distance measurement function has attracted special attention of researchers due to its simplicity and effectiveness.However,the study found that the value range of KL distance is [0,+∞],and directly measuring the difference between the distribution of training and test data based on KL distance may not provide an appropriate general metric,because different data sets may lead to great changes in the KL distance between the training and test data.Based on this,this paper normalized the KL distance between training and test data obtained from different data sets to the value range of [0,1] by using logit transform,and provided an appropriate general measurement criterion to measure the distribution consistency of training and test data.Then,it is proved theoretically that the proposed metric has the sign-preserving property,that is,when the sample size tends to infinity,the conclusion of uniform distribution can be obtained by using the general metric under finite samples.The rationality and effectiveness of the proposed general measurement criteria are verified by a large number of simulation and real data experiments.Furthermore,a new method based on distribution weighting is proposed to correct the distribution difference of training and test data,so that the distribution of the corrected training and test data is consistent,thus avoiding the problem of the model performance decline caused by the inconsistency of distribution.In addition,the KL distance metric is also applied to the selection of the number of folds in the K-fold cross validation,and a selection criterion of the number of folds in the K-fold cross validation based on the regularization of the KL distance is proposed.The validity of the proposed criterion is verified by several real data experiments.

Keywords/Search Tags:

KL(Kullback-Leibler) distance, Distribution consistency, Metrics, K-fold cross-validation, A selection of the fold K

PDF Full Text Request

Related items

1	Risk Prediction Of Illegal Fund Raising Of Enterprises Based On Machine Learning Methods
2	Analysis And Research Based On Multivariate Statistics And Machine Learning
3	Explore The Strategies Of The Teaching Of Chinese In The Primary School From The Art Angle Of Yu Yongzheng’s Five-Fold Teaching
4	Research On "Five-Fold Teaching" Of Yu Yongzheng In Chinese Teaching In Primary School
5	Selection Of Splitting Variable In CATR
6	On Subcutaneous Fat Distribution Of Different Sports University Students
7	Study On The Variation Of Body Composition And Skin-fold Thickness Of 7 To 15–Years Old Primary And Middle School Students In Guangdong Province
8	Structural Learning Of Bayesian Networks By Bootstrap
9	The Effects Of Basketball And Aerobics Exercise In Moderate Intensity And Different Frequency On The Skin Fold Fat Thickness, Physical Self-esteem And Mood
10	Adaptive Cross-Validation In High-Dimensional Regression Models