Font Size: a A A

Large dimension and small sample size problems: Classification, gene selection and asymptotics

Posted on:2007-09-28Degree:Ph.DType:Dissertation
University:Michigan State UniversityCandidate:Luo, JunFull Text:PDF
GTID:1448390005968832Subject:Statistics
Abstract/Summary:
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) and penalized logistic regression (PLR) have been successfully applied to microarray cancer diagnosis problems. The two methods treat equal penalty on each loss. That may lead to misclassification on unbalanced data. So we propose nu-ridge regression (nu-RR), which puts a generalized weight on the loss of each sample and optimizes the weight vector by the model itself, as an alternative method to the SVM and PLR for classification in microarray cancer diagnosis. Often a primary goal in microarray analysis is to identify the genes which are most responsible for classification in microarray. Two gene selection methods are considered, univariate ranking (UR) and recursive feature elimination (RFE).;Simulation on the well known leukemia data and breast cancer prognosis data indicates that nu-RR combined with either UR or REF tends to select less significant genes than other methods. Meanwhile, nu-RR performs superior to SVM and PLR with a lower rate in both cross-validation error and test error.;One of the weaknesses of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimation of the underlying probability. The penalized logistic regression has the advantage of additionally providing an estimate of the underlying probability of being assigned to each class, but in fact it does not offer any estimate for the probability of the outcome class, conditional on an individual gene variable. We propose the conditional logistic regression (CLR) model, which is an alternative for the microarray cancer diagnosis classification, for the underlying probability of the response given any gene variable. In addition, since a primary goal in microarray cancer diagnosis is gene selection, we propose a new method called modified univariate ranking (MUR) as a new choice for dimension reduction.;We show that when applied to a microarray data for classification, CLR performs similarly to SVM, PLR and BMA, but CLR has the advantage of providing the probability of the outcome class, conditional on any individual gene variable. Empirical results on leukemia and breast cancer data indicate that the CLR method combined with one gene selection method (MUR, BSS/WSS or RFE) tends to perform superior on both CV-error and test error rate.;Microarray data typically have very high dimension p and much smaller sample size n. Classical asymptotic theory deals with p fixed and n goes to infinity, which is no longer appropriate for microarray data analysis. There are discussions in the literature about the behavior of estimations when both p and n tend to infinity, but very few dealing with n fixed and p tends to infinity. The latter situation seems more relevant to microarray data in practice. Here we outline and describe the asymptotical behavior of ridge regression estimations when sample size n is fixed and dimension p tends to infinity. Given certain data, mean squared error consistency is established under certain regularity conditions. When there are only finite number of important genes that are actually related to the outcome, we propose a variable screening method to eliminate genes which are unrelated to the outcome and prove the asymptotic consistency of the procedure. After screening, the dimension-reduced microarray data can be further analyzed via a well-known variable selection method such as AIC and BIC. Some simulation results for testing the performance of the screening method are also presented.
Keywords/Search Tags:Sample, Classification, Gene selection, Cancer diagnosis, SVM, Method, Dimension, CLR
Related items