Font Size: a A A

Study On The Strategy Of Classification Methods In Data Mining And Their Applications In Biomedicine

Posted on:2009-11-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:1114360272962178Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background: Data mining is a technology that extracts knowledge and information from large amount of data by combining statistics, database and artificial intelligence technology all together. It has been reported in biomedicine, that only less than 10 percent of the data were used for analysis. At the same time, the method of data mining for biomedical data is relatively scarce and laggard, which makes it a promising and hot research field nowadays, and thus it need to be studied more and more with the dramatic increase of biomedical data.The classification method is widely used in the biomedical field, such as image classification in iconography, assisting diagnosis in pathology, clinical laboratory, Chinese medicine, gene chips and microarray data of molecular biology, as well as life insurance and.. Therefore, the study on how to correctly use classification method to get valuable knowledge and information from large amount of data has practical significance and comprehensive application.Objective: To explore the strategy of classification analysis in data mining based on comparisons of eight clasificaion methods using the Monte Carlo technique, and make it reasonable and practical in the application of data mining.Methods: The classification methods studied are linear discriment analysis (LDA), quadratic discriment analysis (QDA), K-nearest neighbor method (KNN), logistic regression, chi-square automatic interaction detector (CHAID), C4.5, classification and regression trees (CART) and back propagation neural network (BPNN). The target variable was set to binary; the number of predicted variables was set to 3.These methods were compared with Monte Carlo simulation study and all simulations and analyses were conducted using the SAS 9.13 software.To assess the performance of the eight procedures under a variety of conditions, several variables were manipulated, including distribution of the predictor variables, sample size, level of covariance matrix heterogeneity between the groups, proportion of cases in each group, effect size separating the groups, multicollinearity between the predictor variables and the prior probability. The distribution of the predictor variables has four levels: (a) all normal, (b) all nonnormal with skewness of 2.0 and kurtosis of 7.0, (c) mixed normal and dichotomous categorical, and (d) all dichotomous categorical variables. The correlation between the three predictors was set at 0.3 for all simulations but multicollinearity. The multivariate normal data were generated using the SAS IML, which allows the user to specify the covariance structure as well as the means of the variables. The skewed data were created by first generating standard normal variables, which were then transformed based on the methodology for creating skewed, correlated data. The dichotomous predictor variables were created using random uniform variables generated by the Uniform functionin of SAS. Then, for each simulee, a "1" was assigned for the dichotomous variable(s), if the uniform value was larger than 0.5 and a "0" otherwise. There were three sample size levels totally, that was, 60,100,and 400 subjects. The ratio of subjects bewteen groups was 50:50, 25:75, and 10:90. The variances of between groups were simulated to be either homogeneous or heterogeneous with a 4:1 ratio or an 8:1 ratio, and the effect sizes were incorporated by setting the mean of one group at 0 and of the others at either 0.2, 0.5 or 0.8. The correlation between the three predictors was set at 0.3 for all simulations but the level of multicollinearity were set as 0.6 and 0.9. The prior probability was set at proportion for all preliminary simulation. Then it was set at 0.5:0.5 and 0.25:0.75 as contrast. It should be noted that for all the categorical conditions, only the equal covariance condition was used to amintain the appropriate effect size. For the dichotomous variables, the variance is confounded with the proportion of subjects in each group, that is, the variance is a function of the proportion in each group, and therefore the impact of different group variances could not be examined.Performance of eight methods was assessed using the overall misclassification rates, sensitivity, specificity and the area under the ROC curve (AUC). Within each cell condition under each data structure pattern, two random samples were generated. The first one was set as training data. The second one was set as a test data. For each combination of conditions, 1000 simulations were run.Results1 Distribution of the predictor variables(a) The overall misclassification rate of LDA was the smallest and the AUC of LDA was the largest in the eight methods by the underlying multivariate normal distribution, proportion of prior probability and equal covariance matrices. The second position was logistic regression. The parameter methods were better than non-parametric methods. QDA was the best by the underlying multivariate normal distribution, and unequal covariance matrices. LDA and logistic regression were the worst.(b) Decision tree and BPNN were better than the other methods by the underlying skewness distribution and equal covariance matrices. When the variances were unequal Decision tree, KNN and QDA were in good performance. On the other hand, LDA and logistic regression were poor. (c) Compared with the others, the overall misclassification rates of Decision tree and logistic were smaller and the AUC of them were larger by the underlying mixed distribution, proportion of prior probability and equal covariance matrices. LDA, QDA and KNN were the worst. Decision tree was the best performance when the variances were unequal. On the other hand, LDA and logistic regression were in the last position.(d) Decision tree and logistic regression had lower misclassification rate and larger AUC, when the distribution was dichotomous category. LDA, QDA, KNN, BPNN were reversed.2 Heterogeneity of covariance matricesHeterogeneity of covariance matrices played a relatively important role in the misclassification rate of the eight methods. The misclassification rate of the larger group of heterogeneity of covariance matrices was higher than the rate of group of equal covariance matrices. The misclassification rate of smaller group was gain. The loss of the larger group was more than the gain of smaller group so that the overall misclassification rates were loss. The more unbalanced of heterogeneity of covariance matrices, the more this trend was obvious. To take multivariate normal distribution as sample, the ratio of the two groups misclassification rate was 1.14~2.30 when the covariance of 1:4 and it was 1.10~3.80 when the covariance of 1:8.3 Sample size and sample ratioWhen the two groups have the same distribution, effect size and covariance matrix, the misclassification rates were reducted with the increase of the total sample, while increasing the AUC. Sample size played a relatively minor role in the misclassification rate of these methods. When the sample was increased from 60 to 400, the overall misclassification rates were decreased to 2%~11% under normal distribution and equal covariance matrix. The most sensitive method was BPNN, and KNN was reversed.Sample ratio had great impact on the misclassification rates. The misclassification rate of the larger group was lower than the rate of the group with equal sample ratio. The smaller group was reversed. The loss of larger group was more than the gain of smaller group so that the overall misclassification rates were loss. The more the sample ratio unbalanced, the more obvious this trend was. For example, the misclassification rates were 10%~98% decrease in the larger group, which occured when the sample ratio changed from 50:50 to 10:90. The misclassification rates were 17%~83% increase in the smaller group. This change may seriously impact sensitivity.4 Effect sizeEffect size played a relatively important role in the misclassification rate. The misclassification rates were decreased with the increase of effect size between the two groups, while increasing the area under the ROC curve. All the methods had different degree of changes, for example, the misclassification rates was 30%~55% decrease, when the effect size increased from 0.2 to 0.8 under normal distribution, equal sample ratio and the same covariance matrices.5 MulticollinearityThe misclassification rates were minor decreased with the increase of multicollinearity among predicted variables. For example, the misclassification rate was 1%~9% decrease when the correlation coefficient increased from 0.3 to 0.9 under normal distribution, equal sample ratio and the covariance of 1:4. Multicollinearity played a relatively minor role in the misclassification rate in this Monte Carlo study. We think the results were related with the condition of simulation, that is, X2 and X3 had a stronger relationship. 6 Prior probabilityPrior probability had obvious impact on the misclassification rates. The misclassification rate of the larger prior probability group was lower than the rate of the group with equal prior probability when the other conditions were specifed. The smaller group was reversed. The loss of the larger group was more than the gain of the smaller group so that the overall misclassification rates were loss. For example, the misclassification rate was 1.48~8.57 times decrease in the larger group, which occured when the prior probability changed from 1:1 to 1:3 under the condition of normal distribution, equal sample ratio and homogeneous covariance matrices. On the other hand, the misclassification rate was 1.35~2.94 times increase in the smaller group. This change may seriously impact on the sensitivity. The more unbalanced the prior probability ratio, the more this trend was obvious.When the prior probability was set to proporation, the misclassification rate of the larger prior probability and sample ratio group was lower than the rate of the group with the same ratio of prior probability when the other conditions were specifed. For example, the misclassification rate was 2.15~8.90 times decrease in the larger group, which occured when the prior probability was set to proportion from 1:1 to 1:3 Under the condition of normal distribution and homogeneous covariance matrices. The times of the latter were more than the former.7 Case analysisAccording to simulation results, we applied three classification methods on real medical data. First, quadratic discriminant was applied on the model in myocardial infarction data. Second, we Constructed BP neural network model in data of fatty liver. Finally, the CART was applied on the data of diabetes. The results showed that the three models were well fit the data respectively and they can be applied to medical practice. Conclusion: Each separated method or combined method in our study represents advantages in a specific condition of data respectively. Thus, strategies on classification method can be made based on the results of this study.At first, pretreatment should be done in data, such as data cleaning, data integration, data transformation, and data reduction. Secondly, distribution of the predictor variables and heterogeneity of covariance matrices between groups should be concerned. LDA and logistic regression should be selected among the eight methods based on the underlying multivariate normal distribution, proportion of prior probability and equal covariance matrices. Otherwise, QDA base on the underlying multivariate normal distribution and unequal covariance matrices.Decision tree and BPNN were better than the other methods if the underlying skewness distribution and equal covariance matrices were applied. When the variances were unequal Decision tree, KNN and QDA were in good performance and LDA and logistic regression were poor.Decision tree and logistic regression were better than the other methods if the underlying mixed distribution, proportion of prior probability and equal covariance matrices were applied. LDA, QDA and KNN were the worst. Decision tree was the best performance when the variances were unequal. On the other hand, LDA and logistic regression were in the last position.Decision tree and logistic regression should be selected among the eight methods when the distribution of the predictor variables was dichotomous category.
Keywords/Search Tags:Classification, Monte carlo simulation, Discriminant analysis, Decision tree, logistic regression, Neural net work, Multicollinearity, Prior probability
PDF Full Text Request
Related items