| The use of chemical products is related to daily life.Today,more than350,000 chemicals and chemical mixtures have been registered for production and use.These numerous compounds may be exposed to the environment throughout the life cycle of their products,and they may pose a threat to environmental safety and human beings.Biodegradability is that compound molecules can be decomposed and utilized during the metabolic process of biological activities and thus removed from the environment.Bioconcentration factor refers to the ratio of the balanced concentration of chemical substances in the biological body to the balance concentration in the environmental medium,which can be used to measure the degree of absorption and accumulation of certain substances from the environment.Biodegradability and bioconcentration factors are indispensable indicators for compound risk assessment,a large number of compounds are difficult to manage through experimental methods.Therefore,it is of great significance to find efficient and effective computational methods to predict biodegradability and bioconcentration factors.With the development of computing technology,a large number of compounds have been tested by traditional toxicological experimental methods,which provide basic and reliable compound structure data and toxicity data for computer data analysis.Through mathematical modeling and other computer-aided means,we can carry out computational toxicology research,that is,to explore the relationship between molecular structure and toxicity of compounds and make computer modeling predictions.The main work of this paper is as follows:(1)The qualitative classification models for biodegradability of compounds were established.Data on biodegradability of 1958 compounds were collected.Three types of descriptors(CORINA,MACCS fingerprints and ECFP4 fingerprints)were utilized to characterize compounds in the dataset.Four machine learning algorithms(support vector machine,decision tree,random forest and deep neural network)were used to construct 189computer models that could predict the biodegradability of compounds.Model D2,which used MACCS fingerprint and DNN algorithm,performed best(the prediction accuracy of training set Q=89.55%,Matthews correlation coefficient MCC=0.76,test set Q=90.08%,MCC=0.77).Through quantitative structure–activity relationship analysis,it was found that the physicochemical properties such as solubility,πbond electronegativity,πbond charge,rotatable bond number,lone pair electronegativity,effective atomic polarizability and molecular weight may be important for the biodegradability of compounds.The aromatic ring structure and nitrogen and halogen atoms in compound molecules hinder the biodegradability of compounds,while ester groups are beneficial to the biodegradability of compounds.According to the frequency difference between the two types of compounds,the representative substructural fragments of biodegradable and refractory compounds were identified.These substructural fragments will play a warning role in the risk assessment of compounds in the future.(2)The regression models to predict the specific bioconcentration factors of compounds were established.The data of 1294 compounds of fish bioconcentration factors were collected,and the bioconcentration factors were characterized in the database by two kinds of molecular descriptors(CORINA,RDKit),two machine learning algorithms(support vector machine,random forest)were employed to develop regression models.The correlation determination coefficient R~2of the 8 models in the test set was basically greater than 0.7,and the mean square error MSE was also less than 0.6,indicating that the model had a good prediction effect for bioconcentration factors.Among them,the optimal Model F4(constructed by using 56 RDKit descriptors and support vector regression algorithm)R~2=0.9,MSE=0.19 for training,R~2=0.79,MSE=0.42 for test,R~2=0.73,MSE=0.52 for verification.Based on the quantitative configuration relationship analysis,it was found that the characteristic weight of the oil-water partition coefficient(log P)was higher than other characteristics,it can be further concluded that log P plays an important role in the bioconcentration factor of the compounds.Only one log P descriptor was used to predict the BCF of compounds,and it was found that the model was not effective,indicating that only the oil-water partition coefficient could not characterize the compound molecules well.It was also found that the molecular diameter and surface area of compounds had important effects on bioconcentration,and the molecular weight of compounds with small molecular weight was beneficial to bioconcentration.In this paper,the biodegradability or bioconcentration factors of compounds were studied from the perspective of computational prediction,and a series of reliable machine learning models were well-developed.By analyzing the prediction results and the descriptors of the optimal models,some rules between the structure of compounds and their toxicity were explored.These models and conclusions can provide some references for risk assessment and data management of chemicals in the future. |