Metal-Organic Frameworks(MOFs)have shown great potential in industrial adsorption separation,catalysis,biomedical,electronic devices,etc.,due to their highly adjustable properties.A large number of MOFs have been synthesized experimentally and simulated computationally,respectively constructing the CoRE-MOF(Computation-Ready Experimental MOFs)database and hMOF Hypothetical MOFs(Hypothetical MOFs).Traditional experimental verification methods cannot meet the screening requirements,and high-throughput screening based on molecular simulation and machine learning(ML)has been introduced.Due to the efficiency and accuracy of ML in processing high-dimensional data,high-throughput screening based on ML has been widely used.However,issues such as the optimal algorithm selection for MOFs systems,improving algorithms suitable for MOFs systems,interpretability of ML algorithm results,and prediction of unknown new systems have not been resolved.This work is based on the high-throughput screening of CoRE-MOFs and hMOFs using ML to evaluate the optimal ML algorithm for the adsorption performance of six systems,explainability,and algorithm optimization.The specific research work is as follows:(1).It is meaningful to investigate the interaction between feature descriptors,including void fraction(ф ),largest cavity diameter(LCD),volumetric surface area(VSA),heat of adsorption(Qst),density(ρ),henry’s coefficient(K),and performance descriptor adsorption amount(N),in the ML selection process of metal-organic frameworks(MOFs)due to the black-box nature of ML.In this work,we first used ML models to learn the adsorption performance of eleven gases,including methane(CH4,C1),ethane(C2H6,C2),propane(C3H8,C3)in natural gas,propane(C3H8,C3),butane(C4H10,C4),pentane(C5H12,C5),hexane(C6H14,C6),xylene,2-chloroethyl ethyl sulfide(C4H9Cl S,2-CEES),ethanol(C2H5OH,Et OH),and formaldehyde(HCHO)in air,and selected the best ML algorithm,the extreme gradient boosting(XGBoost),to build an interpretable model called XGBoost-SHAP by combining with the SHAP model.The dependencies between descriptors and adsorption performance in the MOFs system were explained locally and globally,and the optimal descriptor interval was finally selected,providing some experience and inspiration for synthesizing the optimal MOF.(2).The high-throughput screening method based on machine learning(ML)is widely applicable in the selection of high-performance MOFs,but a single ML model may exhibit uncertainty in MOFs screening.In this study,we aimed to explore an ML ensemble approach to improve the accuracy of MOFs screening.In the first layer,an anomaly detection algorithm based on the isolation forest algorithm was constructed.In the second and third layers,three stacking ensemble algorithms,namely stacking1,stacking2,and stacking3,were constructed using the optimal ML algorithms selected in the first layer,including Random Forest(RF),Gradient Boosting Decision Tree(GBDT),XGBoost,Decision Tree(DT),TPOT,and Linear Regression.The results showed that the three-layer ensemble model of Isolated forests-stacking1 exhibited the highest prediction accuracy in the screening of CoRE-MOFs,while the three-layer ensemble model of Isolated forests-stacking3 exhibited the highest prediction accuracy in the screening of hMOFs.This provides a new ML algorithm for the selection of optimal MOFs based on high-throughput screening using machine learning.(3).The scale and quality of the dataset are prerequisites for selecting optimal MOFs based on high-throughput screening using machine learning,which limits the selection of optimal MOFs in unknown new domains and small datasets.To leverage the knowledge learned by machine learning in known datasets,we developed an inductive transfer learning(TL)method.The XGBoost algorithm,which exhibited the best predictive performance among all the ML algorithms selected in the first part,was used to construct the transfer learning model.First,knowledge was learned from the hydrophobic MOFs for the adsorption of C3~C5 VOCs in the air as the source domain dataset,which was then transferred to C6.The transfer learning prediction achieved an R2 of 0.9373,with only a 0.0173 difference from the direct model,and the MSE,MAE,and RMSE evaluation metrics were all less than 0.02,indicating the reliability of the transfer learning prediction.Additionally,it is capable of addressing the issue of decreased predictive performance in ML modeling of small datasets.However,when using C1~C3 natural gas adsorption data to verify the knowledge transfer capability,the R2 was only 0.4172,indicating that domain adaptation problems in the source domain had led to a decrease in the effectiveness of transfer learning.When performing transfer learning,it is important to ensure that the data distributions between the source domain and the new domain are similar and stable.This study is based on MOFs and aims to select the optimal ML model from ten ML models for the adsorption data of multi-component gases.Based on this model,an ML interpretable model is constructed,and integrated algorithms and transfer learning suitable for MOFs systems are developed to reveal the importance and dependence of various descriptors inside the ML during the high-throughput screening process of MOFs.The optimal interval that promotes adsorption prediction is selected,which not only improves the accuracy of high-throughput screening based on ML but also enhances the predictive ability of the adsorption performance of unknown MOFs. |