Impact Analysis Of Classification Performance For Data Distribution In Cross-Validation

Posted on:2014-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:C X Zhao

Full Text:PDF

GTID:2268330401962913

Subject:Probability theory and mathematical statistics

Abstract/Summary:

The cross-validation method is commonly used to estimate the performance of a variety of machine learning models. At present, a lot of research work have revealed some of the nature of cross-validation estimates of the expected prediction error, and gives some improvements to the method of cross-validation estimates. For example, used repeatedly to reduce the variance of the cross-validation estimates; stratified cross-validation method to reduce the deviation of the original cross-validation estimates. However, these just are to get a better estimate of the expected prediction error as the goal, but other classification model performance indicators did not involve, for example, accuracy, recall, F value, ROC, AUC, etc.This article mainly focuses on standard2-fold cross-validation estimation changing with the class or design matrix distribution using four evaluations (precision, recall, F-value and accuracy) to measure model performance. In this paper, two classes of classification problems are studied, based on classification algorithm for the logistic regression model and design matrix is only0-1matrix.This article is based on randomly generated simulation data and a large number of simulation experiments are done. The experiment results show that:(1) For the class distribution of the sample:when the same or similar class distribution in the2-fold cross-validation of the two data,2-fold cross-validation estimations of precision, recall, F value and accuracy are minimum deviation, and the deviation of estimation increased with2fold cross-validation class in balance. When the data distributed large difference in2fold cross-validation, the model performance is significantly worse. Therefore, when using cross-verification segmentation, data should be kept that each data category distribution is consistent with the overall.(2) The distribution of the design matrix:when the class distribution are same or similar in the two2-fold cross-validation and the design matrix distribution are different, deviation of estimation are increases with the differences in2-fold cross-validation, so segmentating data, you should try to keep each type of data distribution is consistent with the overall and we should try to keep the design matrix distribution consist with cross-validation.(3) Although the researchers believe that spliting data set, in addition to ensuring the category consistent with the distribution, you should try to ensure that the distribution of the design matrix as consistent as possible. However, for the design matrix is0-1matrix, especially design matrixâ€™s dimension is high, it is difficult to find a good measurement to measure the consistency of the design matrix distribution. This paper attempts to use the KL distance to give the corresponding measure, but the measure is failure in high-dimensional feature matrix.

Keywords/Search Tags:

2-fold cross-validation, category uneven distribution, classified logistic regression model, model performance

Related items

1	Research And Application Of Recommendation Technology Based On Logistic Regression
2	Research On The Prediction Of Insurance Payment Based On Logistic Regression Model
3	Rural Credit Cooperatives Credit Risk Evaluation System Based On Logistic Regression Model
4	The Study Of A Prediction Method For Search Ad CTR Based On Logistic Regression Model
5	Research On The Algorithm Of Multi-instance Learning Based On Logistic Regression Model
6	Inductive Decision Tree Classification Model In The Military Transport Vehicle Management System
7	The Research Of PKI Cross-Domain Bridge Trust Model Based On Validation Agent
8	Robust Low Rank Matrix Recovery And Application Of Sparse Logistic Regression Model
9	Research On Prediction Model Of Repoverty Based On Logistic Regression Analysis
10	An Evaluation Method Of Microblog Userâ€™s Reliability Based On Logistic Regression Model