Environmental persistence of chemicals is an important factor affecting the level of environmental exposure to chemicals.Screening chemicals with environmental persistence is of great importance for chemical risk management.However,using experimental methods to obtain environmental persistence parameters of chemicals is inefficient,time consuming and expensive,thus there is a need to develop efficient(high-throughput and low-cost)prediction techniques.Computational simulation techniques based on quantitative structure-activity relationships(QSAR)can effectively predict the environmental persistence parameters of chemicals by establishing the correlation between their molecular structures and their environmental behavior parameters.In this study,molecular structure descriptors were combined with various machine learning algorithms to develop QSAR models for predicting the ready biodegradability of chemicals and the degradation half-life(t1/2)chemicals in four environmental media(air,water,soil,sediment).The main research contents are as follows:(1)QSAR models were developed to predict the ready biodegradability of chemicals.The data on ready biodegradability of 2043 organic chemicals were collected from relevant literatures and open-source software,then a database on ready biodegradability of chemicals was established.72 individual models for predicting ready biodegradability were developed using 12 molecular fingerprints and 6 machine learning algorithms.Molecular fingerprints were computed using the Pa DEL-Descriptor software,and machine learning algorithms included K-nearest neighbor,logistic regression,Bernoulli naive bayes,decision tree,random forest,and support vector machine.Ten-fold cross-validation and external validation were used to evaluate the robustness and generalization ability of the model.Using the molecular fingerprints and algorithms that performed well in the individual models,28 ensemble models were further developed and the optimal ensemble model was assessed for its application domain based on molecular similarity.Results showed the ensemble model has better goodness-of-fit,robustness and generalization ability compared with the individual models,and the generalization ability of the model was further enhanced by characterizing the model application domain.Based on the molecular similarity approach,the application domain of the optimal ensemble model was characterized.It is found that setting a suitable range of application domains can significantly improve the model generalization ability.(2)QSAR models were developed to predict t1/2of chemicals in four media.The t1/2data of 250 organic chemicals in four media were collected from relevant literatures and physicochemical property handbooks.Single-task neural network models for predicting t1/2of chemicals in each medium were developed using Mordred descriptors and 12molecular fingerprints combined with the multilayer feedforward neural networks algorithm.Two types of multi-task neural network models that could simultaneously predict t1/2(air),t1/2(water),t1/2(soil),and t1/2(sediment)were developed with two different input modes,i.e.,single-input multi-task(SIMO-MT)models and multi-input multi-task(MIMO-MT)models,respectively.Results showed that the prediction performance of the multi-task model was better than that of the single-task models.This could be attributed to the reason that multitask learning captured the association information between tasks during model construction and shared it during model training,which improved the prediction of each task.The developed models in this study can efficiently screen environmentally persistent chemicals and provide technical support for chemical risk assessment. |