Font Size: a A A

Kernel partial least squares (K-PLS) for scientific data mining

Posted on:2008-01-11Degree:Ph.DType:Thesis
University:Rensselaer Polytechnic InstituteCandidate:Han, LongFull Text:PDF
GTID:2448390002999907Subject:Engineering
Abstract/Summary:
The aim of this dissertation is the use of kernel partial least squares (K-PLS) for scientific data mining. K-PLS is a machine learning technique that applies the kernel trick to partial least squares, a statistical technique commonly used for collinear data problems in chemometrics and drug design. It can be shown that K-PLS is closely related to modern machine learning techniques such as support vector machines and can also be interpreted as a neural network. Learning is a broad concept and can commonly be divided in 4 complex systems tasks: (1) Problem representation, (2) Data preprocessing, (3) Predictive modeling, and (4) Variable and Feature selection. Each of these components contributes to model transparency and prediction performance.; For the preprocessing part, a basic data transformation technique, Principal Component Analysis (PCA), has been extended to Independent Components Analysis (ICA). The ICA Transform (ICAT) and ICA based data cleansing have been introduced. In addition, a novel kernel centering algorithm has been introduced.; In the machine learning part, SUpport vector Parsimonious ANOVA (SUPANOVA) transparent (reversible) spline kernel has been implemented to improve the causality analysis of the model. The proposed new spline kernel has also been integrated into the K-PLS framework. K-PLS algorithm has also been extended so that it can be implemented with any loss function for multiple responses. Additionally, Renyi's quadratic entropy loss function has been used to deal with unbalanced classification problems.; Two new variable selection algorithms have been introduced in this thesis: (1) Feature selection based on sigma-tuning of the Gaussian kernel, and (2) Random Forests feature selection. These variable selection methods have been demonstrated on benchmark data sets and compared with other feature selection methods based on sensitivity analysis and Z-scores.; Finally, these methodologies have been applied to three different scientific data mining problems: (1) Predicting ischemia from magnetocardiogram data; (2) Quantitative Structure-Activity Relationship (QSAR) drug design for the discovery of novel pharmaceuticals; and (3) Identification of trace materials from terahertz spectra.
Keywords/Search Tags:K-PLS, Partial least squares, Scientific data mining, Kernel, ICA, Feature selection
Related items