| Drug induced hepatotoxicity is an important factor leading to the failure of new drug research and withdrawal of clinical drugs.According to statistics,the proportion of failures due to hepatotoxicity of drug candidates during the development of new drugs is 37%;in clinical applications,drug withdrawal due to drug induced hepatotoxicity accounts for 18% of the drug withdrawal rate.Therefore,in early drug development and clinical use,the prediction of drug induced hepatotoxicity is of great significance for improving the research success rate and rational drug use.Due to the complex mechanism of drug induced hepatotoxicity,how to improve the accuracy and applicability of drug induced hepatotoxicity prediction is particularly challenging for the prediction of late-onset hepatotoxicity.Here,the present study attempts to construct a drug induced hepatotoxicity prediction model combined with gene expression data and machine learning methods,in order to achieve a variety of predictions of hepatotoxicity and early prediction of late-onset hepatotoxicity.1.ReviewThis chapter reviews current situation of drug hepatotoxicity and its prediction.Firstly,we introduce the concept of drug hepatotoxicity,and illuminate the importance of the compound hepatotoxicity prediction in the progress of drug development and application.Secondly,we provide a general overviews of prediction methods of drug hepatotoxicity(in vitro and in vivo biological assays,expert systems,based on compound prediction methods,machine learning prediction methods,and prediction methods based on gene expression data),which provide a theoretical support for this study.2.Data collection and processing of the Prediction Model of HepatotoxicityThe purpose of this chapter is to collect the gene expression data of hepatotoxic drugs(or compounds),and then performing pretreatment,grouping,and screening of characteristic genes.There are 87 compounds of 988 samples of gene expression data at different time points and different doses were collected from the TG-GATE database and the Array Express database(492 of them were control samples and 496 were drug administration samples).To enable the model have the ability to effectively predict gene expression data from different sources(conditions)and can predict the late-onset hepatotoxicity in time.In this study,according to the degree of biological effects(hepatotoxicity)corresponding to the samples of gene expression data,all gene expression data samples at all time points and all doses corresponding to compounds with moderate or more hepatotoxicity were used as positive samples.The control group and the gene expression data corresponding to the compound with slight hepatotoxicity were used as negative samples.There are 123 positive samples and 121 negative samples were randomly selected as the training set,26 positive samples and 24 negative samples as the test set,and the remaining samples were used as feature gene screening.After filling and standardizing the nearest neighbors for nulls and invalid values in the gene chip,the characteristic genes were screened by combining gene differential expression analysis with the Boruta algorithm.Firstly,the positive samples used for feature screening was performed gene differential expression analysis by using the Bioconductor package in R,375 differentially expressed genes were screened;secondly,to reduce the dimensions,the differentially expressed genes were further screened for characteristics using the Boruta algorithm for packing around the random forest,screening 78 feature genes for the model building.3.Construction,optimization and performance testing of SVM Prediction Model of HepatotoxicityOn this study,the building of initial model by using libsvm and the training set,the cross-validation accuracy rate as the index,and the unique variable principle is used to screen the basic parameters of the model construction.The results show that when the cross-validation mode is 7-fold cross-validation,the model type is nu-SVC and kernel functions is RBF kernel functions,and the rest of the parameters are set as the default parameter construction model,the constructed model obtains the best prediction performance,which get accuracy of the cross validation for the training set is 90.8163%,the prediction result for the test set are 57.6923% and100% of sensitive and specificity,the accuracy ACC is 78.00%,and the Matthews correlation coefficient MCC is 62.8971%.Based on this parameter,three optimization models based on GA(genetic algorithm),GS(grid search algorithm)and PSO(particle swarm algorithm)are constructed to further optimize the penalty parameter c and kernel function parameter g of the model.By comparing the best parameters obtained in the optimization of the three models,the cross-validation accuracy of the model to the training set and the performance indicators for the prediction oftestset,which show that we get the best prediction performance of the model(optimum penalty parameter c is 0.88064 and kernel function parameter g is 0.1),the cross-validation accuracy rate for training set is 89.7959%,the test set of performance evaluation indicators SP and SE are 100% and 73.0769%,the ACC is 86.00%,and the Matthews correlation coefficient is 75.2168% when using PSO algorithm.Therefore,the best prediction model is determined to be 7-fold cross-validation in the cross-validation mode,the model type is nu-SVC,the kernel function is the RBF kernel function,the penalty parameter c is set to 0.88064,and the kernel function parameter g is set to 0.1.4.The literature and experimental validation of the optimal SVM Prediction Model of HepatotoxicityTo further investigate the predictive performance of the best prediction model,this study used two sets of gene expression data reported from different sources of data used in model construction as external tests to verify the best prediction model.One group consisted of 46 gene expression data samples corresponding to 7 compounds with known hepatotoxicity;the other group consisted of 6 gene expression data samples corresponding to 5 consecutive days of continuous administration of Vinblastine,a compound whose hepatotoxicity was unknown.The prediction results showed that the 46 samples corresponding to known hepatotoxic compounds were all predicted to be positive for hepatotoxicity,which is consistent with the literature reports.The six samples of Vinblastine were all predicted to be positive.To verify the accuracy of the predictions,we investigated whether there was hepatotoxicity in Sprague Dawley rats administered by Vinblastine.Diagnosing serum enzyme assays for ALT,AST and H-E pathological sections under the same dosing and feeding conditions as the predicted gene expression data showed that hepatotoxicity did not occur in rats at 1 to 5 days,but appeared at the9 th day after dosing.The results above prove that this predictive model can be applied to gene expression data from different sources and can predict early hepatotoxicity at an early stage with high predictive performance.In summary,this study successfully constructed a hepatotoxicity prediction model based on gene expression data and machine learning,which can effectively predict gene expression data from different sources,and it has good applicability and can predict late-onset liver toxicity in advance,and has high prediction performance,which provides reference for the prediction of drug liver toxicity. |