Font Size: a A A

Study On Screening Models Of Carcinogenic Chemicals Constructed With Machine Learning

Posted on:2023-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:C WuFull Text:PDF
GTID:2531306830479664Subject:Environmental engineering
Abstract/Summary:PDF Full Text Request
Chemicals have enabled marked improvements in our quality of life.However,exposure to hazardous chemicals can result in risks on humans and ecology.Screening and controlling carcinogenic chemicals are of great significance to protect human health.However,animal-based test methods for screening carcinogenic chemicals are low-throughput,time-consuming and even unethical,which are difficult to meet the demand for chemical risk management.Quantitative structure-activity relationship(QSAR)models,one of the most crucial computational toxicological methods,can overcome the weakness of experimental methods and provide necessary tools for chemical risk management.In this study,multiple machine learning algorithms were used to construct the QSAR models for screening carcinogenic chemicals.The main research work is as follows:(1)Individual and ensemble models for screening carcinogenic chemicals were constructed using different machine learning algorithms such as support vector machines,k-nearest neighbors,random forests,gradient boosted decision trees and artificial neural networks.A database containing 2,107 chemicals was developed through literature research,of which1,152 were labeled with carcinogenic chemicals.Thirteen molecular features including CDK,Pub Chem fingerprints and descriptors were calculated to construct 65 individual models.The models were evaluated by parameters such as accuracy(RA),specificity(RSP)and area under the receiver operating characteristic curve(AROC).The models with RSP values greater than 60%were further used to build 52 ensemble models on the basis of soft voting strategy.Application domain characterization of the models was performed based on compound structural similarity.The results showed that performance of the ensemble models is better than the individual models.The RA and AROC values of the optimal model are 92.7%and 94.5%on training set,respectively.The RA and AROC of the external validation set are 79.2%and 81.6%,respectively.More than 20,000 chemicals in the Inventory of Existing Chemical Substances in China were screened by the optimal model.A total of 2,052 chemicals were included in the application domain of the models,and 687 of which were predicted to be carcinogenic chemicals.(2)Screening models for carcinogenic chemicals were constructed based on graph neural network(GNN)algorithm.The input of the GNN models are molecular graphs,where nodes and edges represent atoms and bonds respectively.The batch sample size and the number of atomic-level hidden layer neurons in the GNN model are optimized.Other parameters are set to default values.For instance,the number of graph-level hidden layer neurons is 256 and epochs are set to 1000.A total of 6 GNN models were constructed.The RA and AROC of the optimal GNN model on training set are 94.1%and 99.5%,respectively.The RA and AROC of the external validation set are 80.1%and 89.1%,respectively.Application domain characterization of the models was performed based on compound structural similarity.Interpretation of the model was based on attention mechanism of the GNN.The RA and AROC values of the ensemble model constructed in this study are better than previous models on external validation after characterization of application domain.The AROCof the optimal graph neural network model characterized application domain on external validation set is better than previous models.The established models can be used to screen chemicals with carcinogenicity and provide a basis for chemical risk management.
Keywords/Search Tags:Chemicals, Carcinogenicity, Machine learning, Graph neural networks
PDF Full Text Request
Related items