Computer-aided drug design has been used in all stages of drug development,providing a powerful tool for drug discovery.However,the success rate of computer-aided drug design is still low;candidate drugs often show no anticipated biological activity or have serious toxic effects in subsequent experimental tests.Currently,improving the efficiency of virtual screening and the accuracy of toxicity prediction are still key issues in the field of computer-aided drug design.Therefore,this study aimed to develop new virtual screening and toxicity prediction tools using machine learning methods based on large numbers of existing experimental data.On the one hand,using influenza virus neuraminidase as a research object,new approaches for improving the efficiency of virtual screening of influenza virus neuraminidase inhibitors were studied in the aspects of structure-based virtual screening and ligand-based virtual screening.On the other hand,the most severe and most common toxic effects(carcinogenicity,mutagenicity,and hepatotoxicity)were used as the research object to develop new organic compound toxicity prediction models with higher accuracy.The new models developed in this study will provide useful tools for the early stages of drug development and could improve the efficiency of drug development.In structure-based virtual screening,the scoring function is used to estimate the binding affinity between ligand and target.The accuracy of the estimation of binding affinity is one of the key factors affecting the efficiency of virtual screening.The commonly used scoring functions are generic scoring functions,which can be applied to all drug targets.However,different drug targets have different structural properties.And hence,a specific scoring function for a particular target will result in a higher virtual screening efficiency.Therefore,we developed an influenza virus neuraminidase specific scoring function(RF-NA-Score)using random forest algorithm.In 5-fold cross-validation,the Pearson’s correlation coefficient and the root-mean-square error between the binding affinity value spredicted by RF-NA-Score and the experimental values was 0.707 and 1.46,respectively,which was higher than that of RF-Score(a generic scoring function developed using random forest algorithm).Further analysis showed that rescoring the results of molecular docking with RF-NA-Score can significantly improve the efficiency of virtual screening.The virtual screening strategy using RF-NA-Score as rescoring function was then applied to screen NA inhibitors in the SPECS database.And two compounds with novel scaffolds showed inhibitory activities.These results indicate that RF-NA-Score improves the efficiency of virtual screening for NA inhibitors,and can be successfully used to identify NA inhibitor with new scaffolds.In ligand-based virtual screening,quantitative structure-activity relationships(QSAR)models that link structure features of compounds to its biological activity were usually established to predict the biological activity of new compounds.Currently,the QSAR models developed for influenza virus neuraminidase inhibitors did not consider the difference between the subtypes of neuraminidase.The structural feature of the inhibitors may be different due to the difference between the structure of catalytic center in different subtypes.Therefore,it is necessary to establish a QSAR model for a specific class of neuraminidase inhibitors to improve the efficiency of virtual screening.In addition,ensemble learning can be used to form an ensemble model by fusing a series of models that established using different methods.The ensemble model usually shows higher performance.Therefore,a new QSAR model for group 2 neuraminidase inhibitors was developed using ensemble learning method.The best performing ensemble model was Ensemble-Top12(fused 12 base classifiers),giving an AUC(area under the receiver operating characteristic curve,the value of AUC is between 0 and 1,and the larger the value,the stronger the classification ability of the model)of 0.976 and an accuracy of 90.7% in the 5-fold cross-validation.For comparison,QSAR models that do not distinguish the subtypes of neuraminidase were also developed,and the best performing model is RF-RFE,giving an AUC of 0.942 and an accuracy of 87.0%.It is obvious that the QSAR models developed for group 2 neuraminidase inhibitor are more accurate.A variety of tools have been developed to predict the toxicity of organic compounds.But the accuracy of these models is still low.In this study,ensemble learning method was used to develop models for predicting carcinogenicity,mutagenicity,and hepatotoxicity of organic compounds with higher accuracy.Ensemble models for predicting the carcinogenicity of organic compounds were established using a dataset containing 1003 compounds with rat carcinogenicity from CPDB database as the training set.And 40 organic compounds(from the ISSCAN database)that were not duplicated with the training set were used as external test set.The model named Ensemble XGBoost is found to be the best,giving an AUC of 0.765 and an accuracy of 70.1% in 5-fold cross-validation and an AUC of 0.765 and an accuracy of 70.0% in external validation.The AUC and accuracy of Ensemble XGBoost are higher than that of the 36 machine learning models built based on the same training set,indicating that ensemble learning method can improve the performance of the carcinogenicity prediction model.In comparison with some carcinogenicity prediction methods reported in recent years,Ensemble XGBoost has achieved high predictive performance.The Ames mutagenicity benchmark dataset containing 6305 organic compounds were used as training set to develop ensemble models for mutagenicity prediction.And 1178 organic compounds obtained from CCRIS,NTP and ISSTY database were used as external test set.The ensemble model named Ensemble-Top17 was found to be the best,giving an AUC of 0.899 and an accuracy of 82.7% in 5-fold cross-validation and an AUC of 0.894 and an accuracy of 82.1%,in external validation.Ensemble-Top17 can predict the mutagenicity of compounds more accurately when compared with the models reported in the recently published literature.The ensemble models for predicting the hepatotoxicity of organic compounds were trained using 1241 organic compounds collected from the literature.The 286 compounds in the LTKB-BD database were used as external test set.The ensemble model with best performance is Ensemble-Top6,which achieved an AUC of 0.763 and an accuracy of 70.9% in 5-fold cross-validation and an AUC of 0.765 and an accuracy of 86.4% in external validation.Compared with the hepatotoxicity prediction model reported in the literature,the AUC and accuracy of Ensemble-Top6 are higher.In order to facilitate the use of the ensemble models,we have established Web servers called CarcinoPred-EL,MutagenPred-EL and LiverToxPred-EL for the carcinogenicity,mutagenicity and hepatotoxicity prediction models.In summary,this study has done the following innovative work:(1)Developed a new influenza virus neuraminidase-specific scoring function,and designed a more effective virtual screening strategy using this scoring function;(2)Established a QSAR model for group 2 neuraminidase inhibitors,resulting in higher predictive power;(3)Developed more efficient ensemble models for predicting the carcinogenicity,mutagenicity,and hepatotoxicity of organic compounds. |