Font Size: a A A

Model Selection For Analysizing High-dimensional, Strongly Correlated Data

Posted on:2012-10-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:G H FuFull Text:PDF
GTID:1480303353989509Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This dissertation investigates how to deal with the high-dimensional, strongly correlated, multi-collinear data with noise as viewed from model selection. Three main parts are included in the thesis:In chapter two, a novel wavelength region selection algorithm, called elastic net grouping variable selection along with partial least square regression (EN-PLSR), is proposed for multi-component spectral data analysis. The EN-PLSR algorithm can automatically select successive strongly correlated prediction variable groups related to the response variable by two steps. First, a part of correlated predictors are selected and divided into subgroups by means of the grouping effect of elastic net estimator. Then, a recursive leave one group out strategy is employed to further shrink the variable groups in terms of root mean square error of cross validation(RMSECV) criterion. The performance with the real near-infrared (NIR) spectroscopic data sets shows that the EN-PLSR algorithm is competitive with full spectrum PLS and moving window partial least square regression methods and it is suitable for strongly correlated spectroscopic data.In chapter three, we still consider the problem of variable selection and estimation with the strongly correlated multi-collinear data by using grouping variable selection techniques. A new grouping variable selection method, called weight fused elastic net(WFEN), is proposed to deal with the high dimensional collinear data. The proposed model, combining two different grouping effect mechanisms induced by the elastic net and weight fused LASSO penalties, respectively, can be easily unified in the frame of LASSO and computed efficiently. We estimate our algorithm with the simulation and real data sets, the results show that our method is competitive with other related methods especially when the data present high multi-collinearity.In chapter four, a two-step nonlinear classification algorithm is proposed to model the structure-activity relationship (SAR) between bioactivities and molecular descriptors of compounds, which consists of kernel principal component analysis (KPCA) and linear support vector machines (KPCA+LSVM). The use of KPCA is to remove some uninformative gradients such as noises and then exactly capture the latent structure of the training dataset using some new variables called the principal components in the kernel-defined feature space. LSVM makes full use of the maximal margin hyperplane to give the best generalization performance in the KPCA-transformed space. The combination of KPCA and LSVM can effectively improve the prediction performance compared with the linear SVM as well as two nonlinear methods. Three datasets related to different categorical bioactivities of compounds are used to evaluate the performance of KPCA+LSVM. The internal and external validation results show that our algorithm is competitive.
Keywords/Search Tags:Model selection, Grouping variable selection, LASSO, Elastic net, Weight fused LASSO, Strongly correlated, Kernel methods, Kernel principal component analysis (KPCA)
PDF Full Text Request
Related items