Novel Cheminformatics Methods for Modeling Biomolecular Data in High Dimension Low Sample Size (HDLSS) Chemistry Space

Posted on:2013-08-20

Degree:Ph.D

Type:Dissertation

University:The University of North Carolina at Chapel Hill

Candidate:Wu, Tong-Ying

Full Text:PDF

GTID:1458390008473752

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The increasing availability of biological and chemical data has led to a critical need for cheminformatics and bioinformatics tools to analyze the data. However, not all datasets contain sufficient information for significant analysis. One problem is High Dimension Low Sample Size (HDLSS), where the number of structural characteristics, e.g., molecular descriptors, that can be calculated from a single compound (high dimensionality) far exceeds the number of compounds (low sample size). A major challenge associated with modeling HDLSS data is overfitting, and specialized tools are required to overcome the statistical difficulties inherent to HDLSS. We improved the Distance Weighted Discrimination (DWD) statistical learning method through a new variable selection technique for estimating the intrinsic dimension of a dataset, i.e., the minimum number of descriptors to classify data. Compared to SVM and DWD without variable selection, DWD with variable selection achieved higher prediction accuracy on several benchmarking datasets and allowed the identification of key molecular features that contribute to investigated biological properties, e.g., inhibition of AmpC beta-lactamase and binding affinity for the various serotonin receptors.;For analyzing and modeling stereochemistry-dependent datasets, we developed chiral atom-pair descriptors (3D chiral atom-pair), which were calculated from three-dimensional molecular structures. QSAR models built with these descriptors, versus either 3D non-chiral atom-pair or 2D Dragon descriptors, more accurately predicted antimalarial activity and binding affinities of small molecules toward various receptors. Our method not only led to the identification of a subset of chiral atoms that are expected to affect the selected biological property, e.g., antimalarial activity, but also enabled the development of models that would not be possible otherwise.;To aid automatic protein function annotation, especially in the case of functional homologs, we developed new protein descriptors based solely on protein's structure. Our method showed sensitivity comparable to that of ScanPROSITE. When predicted annotations from both ScanPROSITE and our method were combined into a consensus model, we observed a significant gain in prediction reliability and the successful functional annotation of proteins with low sequence similarity.

Keywords/Search Tags:

Low sample size, Data, HDLSS, Method, Modeling, Molecular, Dimension

PDF Full Text Request

Related items

1	Partition clustering of high dimensional low sample size data based on p-values
2	Classification methods for high-dimensional sparse data
3	Large dimension and small sample size problems: Classification, gene selection and asymptotics
4	Small Sample Based Linear Dimension Reduction Algorithm And Applications
5	Research On Virtual Sample Generation Method Based On Gibbs Sampling Algorithm
6	Sample Size in Ordinal Logistic Hierarchical Linear Modeling
7	The Study Of Complex Data Processing Method Based On Classification
8	Based On Fully Parametric Graphic Modeling And Size Of The Relationship-driven
9	Sample size estimation with nonparametric methods for one sample location tests under clustered data
10	Estimating effective sample size for spatially correlated data