Font Size: a A A

Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Developmen

Posted on:2019-05-04Degree:Ph.DType:Dissertation
University:The University of AkronCandidate:Chen, Jonathan Jun FengFull Text:PDF
GTID:1478390017494007Subject:Biochemistry
Abstract/Summary:
Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of genes, proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed.;The work presented used and processed data from the NLM to identify new candidates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound structural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vector machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Compound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of candidates and compounds from the training set were compared to identify structure-activity relationships for additional avenues of inquiry.;Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility.
Keywords/Search Tags:Pipeline, Data, Discovery, Active
Related items