Font Size: a A A

Novel Cheminformatics Methods for Modeling Biomolecular Data in High Dimension Low Sample Size (HDLSS) Chemistry Space

Posted on:2013-08-20Degree:Ph.DType:Dissertation
University:The University of North Carolina at Chapel HillCandidate:Wu, Tong-YingFull Text:PDF
GTID:1458390008473752Subject:Engineering
Abstract/Summary:
The increasing availability of biological and chemical data has led to a critical need for cheminformatics and bioinformatics tools to analyze the data. However, not all datasets contain sufficient information for significant analysis. One problem is High Dimension Low Sample Size (HDLSS), where the number of structural characteristics, e.g., molecular descriptors, that can be calculated from a single compound (high dimensionality) far exceeds the number of compounds (low sample size). A major challenge associated with modeling HDLSS data is overfitting, and specialized tools are required to overcome the statistical difficulties inherent to HDLSS. We improved the Distance Weighted Discrimination (DWD) statistical learning method through a new variable selection technique for estimating the intrinsic dimension of a dataset, i.e., the minimum number of descriptors to classify data. Compared to SVM and DWD without variable selection, DWD with variable selection achieved higher prediction accuracy on several benchmarking datasets and allowed the identification of key molecular features that contribute to investigated biological properties, e.g., inhibition of AmpC beta-lactamase and binding affinity for the various serotonin receptors.;For analyzing and modeling stereochemistry-dependent datasets, we developed chiral atom-pair descriptors (3D chiral atom-pair), which were calculated from three-dimensional molecular structures. QSAR models built with these descriptors, versus either 3D non-chiral atom-pair or 2D Dragon descriptors, more accurately predicted antimalarial activity and binding affinities of small molecules toward various receptors. Our method not only led to the identification of a subset of chiral atoms that are expected to affect the selected biological property, e.g., antimalarial activity, but also enabled the development of models that would not be possible otherwise.;To aid automatic protein function annotation, especially in the case of functional homologs, we developed new protein descriptors based solely on protein's structure. Our method showed sensitivity comparable to that of ScanPROSITE. When predicted annotations from both ScanPROSITE and our method were combined into a consensus model, we observed a significant gain in prediction reliability and the successful functional annotation of proteins with low sequence similarity.
Keywords/Search Tags:Low sample size, Data, HDLSS, Method, Modeling, Molecular, Dimension
Related items