Peroxisome proliferators-activated receptorγ(PPARγ)is a type of nuclear receptor protein that participates in essential physiological functions such as regulation of carbohydrate and lipid metabolism.Chemical molecule that abnormally binds or activates PPARγcould thus disrupt relevant physiological functions,which constitutes a molecular initiating event leading to the macroscopic adverse health effects.Development of computational toxicology models capable of predicting activity categories or continuous values of chemicals,could facilitate filling the PPARγactivity data gap of chemicals efficiently and inexpensively,thus supporting the hazard management of chemicals.This study constructed PPARγactivity database,established quantitative structure-activity relationships(QSAR)models with machine learning algorithms,analyzed the structure-activity landscapes(SAL)of chemicals in data sets,and created applicability domain characterization methods for the models.Main contents and results are as follows:(1)Database of PPARγactivity categories and screening models for the PPARγactivity of chemicals were established.Applicability domain characterization methods for the models were created.Based on Ch EMBL database and scientific literatures in the field of environment,health and toxicology,a training set from sources other than US Tox21/Tox Cast was curated,containing 1767 chemicals(active:1363,inactive:404)with endpoint of PPARγagonist activities.The Tox21 data was adopted as external validation set.With Mordred descriptors,extended connectivity fingerprints(ECFP)and molecular access system(MACCS)fingerprints adopted,and random forest(RF)algorithm employed,screening models(classifiers)were established.The cross-validation accuracy of Mordred descriptors-based RF classifier was 0.91,higher than that of existing Tox21/Tox Cast data-based classifiers.It was found that descriptor space-based applicability domain characterization methods were ineffective.Based on molecular fingerprints(FP)-similarity coefficient thresholds(Scutoff)and the minimum number(Nmin)of training-set chemicals that are judged as structurally similar with the query chemical,a novel applicability domain characterization method,noted as ADFP{Scutoff,Nmin}was developed,and could effectively identify chemicals that can be correctly classified by the models.(2)Data sets and quantitative prediction models for the PPARγactivity continuous values of chemicals were established.By analyzing the intrinsic relation between PPARγactivity endpoints,training set TS-1[negative logarithm values of inhibition constants/dissociation constants(p Ki,p Kd),419 chemicals]and TS-2[negative logarithm values of 50%inhibition concentration(p IC50),1316 chemicals]and corresponding external validation sets with endpoint of PPARγbinding affinities were curated.Quantitative prediction models(regressors)on PPARγactivity values of chemicals were established with Mordred descriptors,ECFP or MACCS fingerprints adopted,employing ridge regression,least absolute shrinkage and selection operator(LASSO)regression,partial least squares regression,RF,support vector machine(SVM),and descriptor selection strategies based on LASSO or RF models.The combination of LASSO-selected 4096(8192)-bit ECFP10 and SVM,had cross-validation coefficient of determination of 0.80(0.81)on the TS-1(TS-2),better than the majority of existing regressors,and the number of training-set chemicals was 3~10(6~73)times of that of existing regressors.Compared with descriptor space-based applicability domain characterization methods,ADFP{Scutoff,Nmin}was found capable of effectively identifying chemicals with lower prediction errors.(3)Prediction errors of models on the activity of chemicals were quantitatively explained.The ability of model applicability domain to identify chemicals with low prediction errors was improved.“Network-like similarity graph”was introduced from medicinal chemistry to characterize and visualize the SAL of QSAR training-set chemicals.Local discontinuity(LDw)based on similarity weighting functions(w)and signed LDw(LDw(±))were developed to quantitatively characterize local SAL properties of chemicals in the data set.Strong and significant linear relationships(Pearson correlation coefficient r>0.8,p<0.001)between LDw(±)and the prediction errors of regressors and classifiers were found,which deepened mechanistic understanding about the relationship between SAL characteristics of chemicals in data sets and the prediction accuracy of associated QSAR models.Two applicability domain metrics based on LDw,i.e.,similarity density(ρs)and local SAL“ruggedness”(V)were created.The Nmin of ADFP{Scutoff,Nmin}was found to be a special case ofρs.Based on the lower boundary ofρs(ρs,LB)and the upper boundary of V(VUB),a new applicability domain characterization method,ADFP{w,ρs,LB,VUB},was developed,which could identify chemicals with low prediction errors even better than the ADFP{Scutoff,Nmin}(mean absolute error dropped from0.46 to 0.27).In summary,database with endpoints of PPARγagonist activities and binding affinities of chemicals were established,providing the data basis for developing computational toxicology models on PPARγactivity of chemicals.With machine learning algorithms employed,screening and quantitative prediction models of PPARγactivity of chemicals were developed.Based on the similarity network analysis of molecular structures and SAL analysis,correlation between SAL local discontinuity and prediction errors of QSAR models was revealed.Applicability domain characterization methods,oriented for QSAR models based on high-dimension descriptors/molecular fingerprints,were developed. |