Kernel partial least squares (K-PLS) for scientific data mining

Posted on:2008-01-11

Degree:Ph.D

Type:Thesis

University:Rensselaer Polytechnic Institute

Candidate:Han, Long

Full Text:PDF

GTID:2448390002999907

Subject:Engineering

Abstract/Summary:

The aim of this dissertation is the use of kernel partial least squares (K-PLS) for scientific data mining. K-PLS is a machine learning technique that applies the kernel trick to partial least squares, a statistical technique commonly used for collinear data problems in chemometrics and drug design. It can be shown that K-PLS is closely related to modern machine learning techniques such as support vector machines and can also be interpreted as a neural network. Learning is a broad concept and can commonly be divided in 4 complex systems tasks: (1) Problem representation, (2) Data preprocessing, (3) Predictive modeling, and (4) Variable and Feature selection. Each of these components contributes to model transparency and prediction performance.; For the preprocessing part, a basic data transformation technique, Principal Component Analysis (PCA), has been extended to Independent Components Analysis (ICA). The ICA Transform (ICAT) and ICA based data cleansing have been introduced. In addition, a novel kernel centering algorithm has been introduced.; In the machine learning part, SUpport vector Parsimonious ANOVA (SUPANOVA) transparent (reversible) spline kernel has been implemented to improve the causality analysis of the model. The proposed new spline kernel has also been integrated into the K-PLS framework. K-PLS algorithm has also been extended so that it can be implemented with any loss function for multiple responses. Additionally, Renyi's quadratic entropy loss function has been used to deal with unbalanced classification problems.; Two new variable selection algorithms have been introduced in this thesis: (1) Feature selection based on sigma-tuning of the Gaussian kernel, and (2) Random Forests feature selection. These variable selection methods have been demonstrated on benchmark data sets and compared with other feature selection methods based on sensitivity analysis and Z-scores.; Finally, these methodologies have been applied to three different scientific data mining problems: (1) Predicting ischemia from magnetocardiogram data; (2) Quantitative Structure-Activity Relationship (QSAR) drug design for the discovery of novel pharmaceuticals; and (3) Identification of trace materials from terahertz spectra.

Keywords/Search Tags:

K-PLS, Partial least squares, Scientific data mining, Kernel, ICA, Feature selection

Related items

1	Spam Filtering Based On Kernel Paitial Least Squares Feature Extraction
2	PLS Algorithm And Its Applications To SRM-Based Machine Learning
3	The Research On Text Categorization Technology Based On Partial Least Square
4	Industrial Process Monitoring Based On Kernel Partial Least Squares
5	Data-driven Key Performance Indicator Related Fault Detection Approaches
6	Based On Fuzzy Partial Least Squares Feature Extraction Methods
7	Research Of Partial Least Squares Regression Algorithm Based On Optimal Selection Of Latent Variables
8	Image Feature Extraction Methods
9	Data Mining And Feature Selection Of High Dimensional Biomedical Data Based On TCGA And Pubmed Databases
10	Hand Gesture Recognition Based On Feature Fusion And Partial Least Squares Dimensionality Reduction