Font Size: a A A

Impact Analysis Of Classification Performance For Data Distribution In Cross-Validation

Posted on:2014-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:C X ZhaoFull Text:PDF
GTID:2268330401962913Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The cross-validation method is commonly used to estimate the performance of a variety of machine learning models. At present, a lot of research work have revealed some of the nature of cross-validation estimates of the expected prediction error, and gives some improvements to the method of cross-validation estimates. For example, used repeatedly to reduce the variance of the cross-validation estimates; stratified cross-validation method to reduce the deviation of the original cross-validation estimates. However, these just are to get a better estimate of the expected prediction error as the goal, but other classification model performance indicators did not involve, for example, accuracy, recall, F value, ROC, AUC, etc.This article mainly focuses on standard2-fold cross-validation estimation changing with the class or design matrix distribution using four evaluations (precision, recall, F-value and accuracy) to measure model performance. In this paper, two classes of classification problems are studied, based on classification algorithm for the logistic regression model and design matrix is only0-1matrix.This article is based on randomly generated simulation data and a large number of simulation experiments are done. The experiment results show that:(1) For the class distribution of the sample:when the same or similar class distribution in the2-fold cross-validation of the two data,2-fold cross-validation estimations of precision, recall, F value and accuracy are minimum deviation, and the deviation of estimation increased with2fold cross-validation class in balance. When the data distributed large difference in2fold cross-validation, the model performance is significantly worse. Therefore, when using cross-verification segmentation, data should be kept that each data category distribution is consistent with the overall.(2) The distribution of the design matrix:when the class distribution are same or similar in the two2-fold cross-validation and the design matrix distribution are different, deviation of estimation are increases with the differences in2-fold cross-validation, so segmentating data, you should try to keep each type of data distribution is consistent with the overall and we should try to keep the design matrix distribution consist with cross-validation.(3) Although the researchers believe that spliting data set, in addition to ensuring the category consistent with the distribution, you should try to ensure that the distribution of the design matrix as consistent as possible. However, for the design matrix is0-1matrix, especially design matrix’s dimension is high, it is difficult to find a good measurement to measure the consistency of the design matrix distribution. This paper attempts to use the KL distance to give the corresponding measure, but the measure is failure in high-dimensional feature matrix.
Keywords/Search Tags:2-fold cross-validation, category uneven distribution, classified logistic regression model, model performance
PDF Full Text Request
Related items