Font Size: a A A

A Performance Comparison Between Logistic Regression, Decision Trees, And Neural Networks In Predictiving Peripheral Neuropathy In Type 2 Diabetes Mellitus

Posted on:2010-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:C P LiFull Text:PDF
GTID:1114360275962276Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
The development of mathematical methods and computer technology has made it possible to use complex models for prediction in recent years. There are two primary methods employed–statistical methods and data mining methods. The predicting technology based on the two methods has been applied to the field of biomedical studies, but there are quite a few studies on comparing performance of prediction, that is, the generalization ability. It is worth comparing the generalization ability of data mining methods with that of statistical methods. In this study, by taking the case control study data of Diabetic Peripheral Neuropathy(DPN) in type 2 Diabetes Mellitus(It is introduced in chapter 2 of this thesis.) for example, some desirable solutions are proposed to several difficulties in building models and in comparing the performance of the logistic regression, decision tree, and neural network for predicting the probability of DPN.The difficulties and the corresponding solutions in predicting the probability of DPN are as follows:(1) Discretization of continuous variables. In some scientific studies, there is no interest in change of specific value of a unit of continuous variables, or discretization of continuous variables is required according to professional knowledge. However, how to scientifically discretize the continuous variables is a problem worthy of study.In this thesis, we use the chi-square partitioning method to discretize the continuous variables. The distinction between classes is maximized after the variables are discretized.(2) Utilization of data information and avoiding overfitting in the process of establishing a model. When the amount of data is limited, it is particularly important to use as much data information as possible and avoiding overfitting. This is very important for the decision tree and neural network.In this research, we combine the classification and regression tree with the chi-squared automatic interaction detector tree by means of the 100 times 5~7 fold stratified cross-validation method to establish the decision tree model to make full use of data information and also to avoid overfitting. In addition, we use Schwarz Bayes Criterion to choose the number of hidden layers and hidden layer units, and use Levenberg-Marquardt optimization algorithm, weight decay, and preliminary training method to train the neural network. We also establish a reliable neural network model to make full use of the data information and also to avoid overfitting and inferior local minima.(3) Quick and efficient establishment of a logistic regression. Conventional screening variable methods of the logistic regression include the forward-entry method, backward-elimination method, stepwise and best subset method. The first three methods are related to how to choose the P value, which is the cutoff point where the variables enter and (or) are removed from the model. It is obvious that P value is subjectively chosen and it is thought that 0.05 slentry is too stringent to include some important variables from the model in some cases. For all the combination of variables, The best subset method can give the corresponding chi-square value, but fails to decide which kind of combination is optimal. Therefore, it is very important to select variables quickly and effectively to establish an accurate and reliable model.In this thesis, we combine the best subset method with the Akaike Information Criterion to screen variables quickly and easily. The method not only takes into account the generalization ability of the model but also saves the"trouble"of artificially choosing of P value. Thus we have built a logistic regression model which is superior to conventional screening methods.(4) Comparison of the generalization ability in case of small sample. Large amounts of reference show that multiple different studies on model prediction and classification technique in the biomedical field have been either applied to a larger data volume(from several hundred to several hundred thousand observations) or use the holdout method(one part of set for training and the remainder for testing) to asses the generalization ability of the model. But these studies are not related to how to make full use of the data information or how to compare the generalization ability when the sample is small (as one hundred observations or so). But the data set is likely be small and the number of variables is large at work. Data information will be lost and low confidence or even unreliable results will ensue(It is proved in chapter 5 of this thesis.) if the holdout method is adopted to evaluate the generalization ability of the model. Therefore it is worth studying how to establish an effective model and make an objective evaluation of the generalization ability when the sample is small, which is the most one of the primary focuses of the study.We adopt Monte Carlo simulation of sampling (2~10 fold stratified cross-validation, jacknife method, 100~1000 times bootstrap method (to be more specific, 0.632bootstrap)) validation technology in the research to make a reliable assessment of generalization errors and make a comparison of the generalization ability in order to achieve an objective evaluation of the three models and avoid the drawback of the holdout method. On the whole, the result shows that the generalization ability of NN is the strongest, followed by LR's, DT's in terms of the data of DPN.(5) Adjustment of oversampling. When the data is obtained by oversampling (that is, separate sampling), the probability of model estimation is not based on the population but on the sample. There may be larger deviation if we predict the probability of disease of the overall population.Because the source of the data is from oversampling, we use prior probability to adjust the posterior probability so that the adjusted results can predict the possibility of outbreak of diseases more objectively and accurately.In a word, we use three methods (LR, DT, NN) to predict the probability of DPN. In case of small sample, we have made comparative studies and improvement in terms of (①Discretization of continuous variables scientifically,②Utilization of data information sufficiently and avoiding overfitting,③Quick and efficient establishment of a model,④Utilization of data information efficiently and improvement the generalization ability,⑤Efficient adjustment of oversampling to acquire more objective and accurate results ), and have achieved the desired results. The concept of modeling and techniques could be conveniently and successfully applied to the field of biomedical studies and other fields.
Keywords/Search Tags:logistic regression, decision tree, neural network, holdout method, cross-validation method, jacknife method, bootstrap method, oversampling, Diabetic Peripheral Neuropathy
PDF Full Text Request
Related items