Photocatalytic material is a type of semiconductor catalytic material that performance photochemical reaction under the light conditions,which has the characteristics of green,safe,mild-react and so on.It has important applications in the fields of new energy development and environmental pollution control.The traditional method of material development is trial-and-error.The disadvantage of this method is that it depends on a large number of repeated tests,resulting in low development efficiency and difficulty in meeting the growing demand for photocatalytic materials in industrial production.With the development of computer technology and material informatics,a variety of material calculation softwares have appeared worldwide that can simulate the internal structure and reaction process of materials.Although this method is more convenient and faster than the traditional trial-and-error,it still needs human intervention in constructing computational input,monitoring computational state and processing computational output,and it lacks the effective way to manage and organize a large amount of heterogeneous data,which makes it difficult to obtain valuable information,and hardly solve the problem of too long new material development period fundamentally.Inspired by the idea of MGI,this study explores the above issues in depth.This study uses computed data as a clue throughout the entire project,linking high-throughput material calculations,databases,and machine learning to form a complete logical chain of data generation-data storage-data mining that greatly reduces the cost of new materials development,and improves the efficiency of new materials development.For data generation and data storage,this study uses the AiiDA high-throughput material computing framework to integrate the PWscf material calculation program and some important components such as the PostgreSQL database into the same system.The study successively design workflow and database of the relaxation calculations of single atoms and elementary substance,and the energy calculation of metal single-atom doping structure based on Al2O3 system,and relaxationelectronic self-consistent workflow based on anti-spinel system,developing data from nothing.In the meantime,10 important modules have been designed according to the specific tasks and requirements,greatly improving the automation of the calculation work,saving a great deal of time and manpower,and finally obtaining more than a thousand original calculation data about photocatalytic materials,laying a foundation of data mining.This study uses random forest algorithm for data mining.After analyzing and comparing the algorithm characteristics of random forests,CART regression tree was selected as a sub-model to construct a random forest regression model.Subsequently,the data of band gap obtained from the high-throughput calculations are taken as the original data set,which are then pretreated for model training.In the process of training the model,we mainly focus on some important parameters such as the number of trees and the maximum depth of the tree,then we use cross-validation to examine the impact of different parameters on the model accuracy.Finally,the model is analyzed according to the performance of the model on the test set. |