Font Size: a A A

Reproductive Toxicity QSXR Models Constructed By Machine Learning And Substructure-Based Graph Convolutional Neural Networks

Posted on:2024-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q ChenFull Text:PDF
GTID:2544307163477824Subject:Pharmaceutical
Abstract/Summary:PDF Full Text Request
Reproductive toxicity refers to the harmful effects of drugs and other chemicals on mammalian reproductive function and offspring development,and is an important aspect of non-clinical safety evaluation of drugs and a key basis for chemical risk management.With the development of artificial intelligence in pharmacology,accurate toxicity prediction of compounds during drug design research can help reduce the probability of candidate drug toxicity and improve drug development efficiency.Organic molecules have graph properties due to their structural composition,so graph convolutional neural networks(GCN)have been widely used in the prediction of physical and chemical properties,biological activity,and toxicity of organic compounds.However,traditional GCN takes the atoms of a molecule as the nodes of the graph,which affects the subsequent model interpretation and the extraction of drug warning structures: when analyzing the contribution of molecules to toxicity at the fragment level,the resulting structures are often fragmented rather than a complete substructure(such as a benzene ring,carboxyl group,indole ring,etc.).Based on this,this paper establishes two reproductive toxicity prediction models for organic compounds: a machine learning model based on molecular descriptors and molecular fingerprints,and a substructure-based graph convolutional neural network model SUS-GCN.Based on research on relevant literature and databases,1973 compounds with reproductive toxicity classification labels were obtained as the dataset.All models were verified and evaluated for performance using five-fold cross-validation and external test set validation,with main evaluation indicators including AUC,accuracy(ACC),specificity(SPE),and sensitivity(SEN).1.Machine learning models based on molecular descriptors and molecular fingerprints.Four machine learning algorithms,namely support vector machine(SVM),k-nearest neighbor(KNN),random forest(RF),and extreme gradient boosting(XGBoost),were used to build machine learning models based on molecular descriptors and molecular fingerprints,respectively.After calculating and screening using Mordred,678 molecular descriptors were obtained as the feature input,and four machine learning models based on molecular descriptors were obtained by combining them with machine learning algorithms.Three molecular fingerprints,namely MACCS,Morgan,and RDKFingerprint,were used,and 12 machine learning models based on molecular fingerprints were obtained by combining them with machine learning algorithms.After training and learning,the performance of the models was evaluated.In five-fold crossvalidation,the MACCS + XGBoost model had the highest AUC value(0.915),RDKFingerprint + XGBoost had an ACC of 0.836,and MACCS + XGBoost had a SEN of 0.812.In the external test set,the MACCS + XGBoost model had the highest AUC of 0.907,and the ACC of all models ranged from 0.72 to 0.86.The SPE of Morgan + KNN reached 0.972.2.Substructure-based graph convolutional neural network model SUS-GCN.Based on the principle of graph convolutional neural network,the traditional method of regarding atoms as graph nodes was changed,and organic compound substructures were used as nodes and covalent bonds connecting substructures were used as edges.A substructure library including 156 substructures such as benzene rings,indoles,pyridines,carboxyl groups,etc.was established.The feature of a node is the feature of the corresponding substructure,and the feature of an edge is the feature of the covalent bond connecting the substructures.Through the adjacency matrix,the feature information of the molecule is transformed into the feature information of the graph.The SUS-GCN model includes convolutional layers and multilayer perceptron layers,uses the Adam optimizer,and introduces an early stopping mechanism to suppress overfitting.The effects of hyperparameters such as the number of hidden layers,the size of the hidden layer,the learning rate,and the dropout rate on the model results were explored.In fivefold cross-validation,the AUC of the model ranged from 0.72 to 0.86,the ACC ranged from 0.70 to 0.80,the SPE ranged from 0.77 to 0.87,and the SEN ranged from 0.60 to 0.74.In the external test set,SUS-GCN13 had the highest ACC of 0.826,and the ACC of all models ranged from 0.79 to 0.83.The AUC of SUS-GCN13 was 0.866,and the SEN was 0.814.In conclusion,the machine learning models based on molecular descriptors and molecular fingerprints and the substructure-based graph convolutional neural network model SUS-GCN have good performance in predicting the reproductive toxicity of organic compounds.The SUS-GCN model has the advantages of good interpretability and the ability to extract drug warning structures,which can provide a reference for the design and synthesis of new drugs.
Keywords/Search Tags:reproductive toxicity, machine learning, graph convolutional neural networks, substructure, chemical fingerprint
PDF Full Text Request
Related items