Font Size: a A A

Pre-processing methods and stepwise variable selection for binary classification of high-dimensional data

Posted on:2011-04-30Degree:Ph.DType:Dissertation
University:The University of Texas at DallasCandidate:Ramachandar, ShahlaFull Text:PDF
GTID:1448390002454599Subject:Statistics
Abstract/Summary:
Classification of biological data has gained a lot of attention in recent years. It is of particular importance in cancer-related data, where an accurate classification could make a huge difference. This study is restricted to binary classification of genomic and metabolomic data. We are interested in modeling the response variable which could be an indicator of a tumor or a type of tumor, as a function of the genes or frequencies represented in the biological sample units. Apart from classification, identifying the underlying variables which drive the classification is a critical issue. Unlike many other studies which focus on dimension reduction, we emphasize on variable selection. The high-dimensionality of these data coupled with extremely small sample sizes renders conventional modeling methods inefficient.;First, some pre-processing methods are studied and applied to eliminate noise variables. Depending on the nature of the data, a single method may not work in all cases. The dataset is split into a training set and a cross-validation set. The core algorithm is based on the technique of partial least squares regression. We have developed a new stepwise method wherein the sum of squares of errors plays a major role in efficiently identifying explanatory variables significant to the classification. It is to be pointed out here that we are not after finding the best model as such a model may not exist. Trade-offs between accuracy and efficiency are cautiously regarded. Performance is assessed by analyzing the leave-one-sample-out misclassification error rates for both the training data and the cross-validation data. Developed methodologies are then applied to another dataset without a split. Biological data exhibit high correlation among predictor variables. Due to this reason, correlation analyses of finally selected variables with all the other variables is required. This gives us an idea of possible variables that may have similar impact. This may be of more importance to biologists who can utilize this information for appropriate interpretation.
Keywords/Search Tags:Data, Classification, Methods, Variable
Related items