Font Size: a A A

Scalable customized machine learning models motivated by pharmaceutical chemistry applications

Posted on:2011-10-28Degree:Ph.DType:Dissertation
University:Rensselaer Polytechnic InstituteCandidate:Bergeron, CharlesFull Text:PDF
GTID:1448390002950517Subject:Applied Mathematics
Abstract/Summary:
This dissertation deals with two aspects in drug discovery: the identification of screening hits (finding compounds that are effective at modulating a disease-specific biological pathway) and the analysis of the metabolic liability properties of compounds (their ability to withstand degradation in the liver). The models and algorithms developed in this work for this application are focused on two particular properties: customization (models formulated to answer specific questions) and linear scalability (computational effort in model calculation is proportional to sample size).;Drug discovery is focused on developing new medicines to treat a wide range of medical conditions. Getting a single drug approved is a lengthy and expensive procedure. Mathematical modeling using recent learning techniques from data are being exploited to shave off time and cost from this process. This dissertation develops linearly scalable customized techniques to extract better quality information from experimental data and calculates robust models to explain chemical phenomena in pharmaceutical datasets and make predictions on new compounds.;This work finds that improved performance results from adopting modeling approaches that are customized to specific applications and develops linearly scalable algorithms, a necessary feature when working with increasingly large datasets generated by computational chemists and stemming from other applications.;Motivated by the need for accurate metabolic liability modeling, this dissertation develops a paradigm called multiple-instance ranking (MIRank). The underlying datasets present three levels of structure: compounds, potential sites of metabolic liability and atoms. Accuracy is measured at the compound level, while the response is known at the site level and features are computed at the atomic level. Exploiting this special structure results in MIRank outperforming other state-of-the-art formulations for metabolic liability modeling and other real and synthetic datasets. Multiple instance ranking is tested on other real and synthetic datasets, and further application areas for MIRank are identified.;Multiple-instance learning (MIL) problems (including MIRank) are nonconvex. This dissertation implements and adapts a nonsmooth nonconvex subgradient method that is little-known to the machine learning community. Empirical results display linear scalability on the class of MIL problems. This fast algorithm permits the model parameter estimation on a large number of datasets that possess more samples and features, and affords using kernel functions for nonlinear modeling.;This dissertation identifies limitations to least-squares curve fitting of the nonlinear Hill equation to quantitative high-throughput screening (qHTS) datapoints. A customized technique called DK-fitter is proposed that injects domain knowledge (or prior knowledge) into the curve-fitting process. A new validation technique objectively demonstrates the superiority of DK-fitter over the standard least-squares approach.;Finally, this dissertation performs screening hit identification using computers. This work applies quantile regression, widely used in econometrics and ecology, as a well-suited modeling framework for this problem, and finds that generated lists of predicted top-active compounds contain up to 17 times more true positives than a random model.;Models that are presented in this dissertation possess improved predictive and generalization ability over previously published ones (to the extent that comparisons are possible) and/or validated on standard real and synthetic data from the literature. Some models display performances that are competitive with the literature, but are calculated using faster algorithms. Tight experimental designs were implemented so as to produce robust models that generalize well. This dissertation advances the state-of-the-art in nonconvex subgradient methods, nonlinear curve fitting and drug discovery modeling.
Keywords/Search Tags:Models, Dissertation, Drug discovery, Customized, Compounds, Scalable, Metabolic liability
Related items