| Knowledge Graphs(KG)is a new technology proposed by Google in 2012.With the continuous launch of its related applications,this technology has also attracted extensive attention in the industry and academia.In recent years,the related technology of knowledge graph is gradually mature,and many researchers are actively using knowledge graph technology to solve professional problems in the field.The construction of knowledge graph in vertical domain has become a very important research topic in the era of artificial intelligence.But there are still many areas that lack specialized knowledge graphs for researchers and developers to use.Through literature research,we found that most of the molecular knowledge graphs developed in the field of chemistry are based on molecules as entity nodes,which makes it difficult to use the knowledge graph of these molecular nodes as a support for the research based on atomic composition such as new molecular design and molecular generation.Molecular knowledge graph based on atomic nodes are of great value for De Novo molecular design and generation,but the labeling of atomic entities becomes a difficult problem in the construction of molecular knowledge graph.The type of atomic entity is determined by molecular structure,element type,bond type and many other factors.Even researchers in the professional field sometimes find it difficult to accurately identify the atomic type in many chemical molecules,and more accurately define the atomic entity and extract and classify.In addition,the structure of chemical molecules is complex and diverse,and the amount of chemical molecular data is huge.It can be seen that the workload of artificial entity extraction consumes huge manpower and costs high.Recently,the rapid development of deep learning algorithm provides a new choice for the construction of molecular knowledge graph.The deep learning method is a representation learning method.Using the deep learning method to obtain the vector representation of atomic entity as the node representation of molecular knowledge graph can effectively solve the problem of entity extraction.After classifying the vector representation of atomic entities and naming atomic entities,the bonding relationship between atomic entities can be established,that is,they can be constructed into a knowledge graph.In addition,the deep learning model can adopt the pre-training method of unsupervised learning,which can save a lot of time and manpower for data annotation.Therefore,the construction of molecular knowledge graph based on the pre-training model can complete the construction task of the graph more quickly and cheaply,and make the application of the graph more convenient and universal,which can provide a new scheme for molecular design.Based on the above considerations,this thesis conducts relevant research,with specific contents as follows:In this thesis,a molecular knowledge graph with atoms as nodes is constructed and applied.Firstly,the vector representation of atoms was obtained through the pre-training model,and then the entity naming was realized through entity clustering to obtain the representation of atomic node entities,which was used to construct the molecular knowledge graph.In this thesis,molecular knowledge graph were applied to molecular design and property prediction to demonstrate the applications of the graph and to preliminarily test the quality of the graph.Firstly,this thesis breaks through the traditional methods to extract atomic entities from molecular data based on the SMILES sequences of unannotated molecular data and the deep learning pre-training model,and realizes the construction of molecular knowledge graph based on atomic entities.Firstly,the SMILES sequences of molecules can be pretreated with RDKit to obtain the information of various chemical properties and structures within the molecules.These features are input into the pre-training model(Chem BERTa)to obtain the embedded representation of atoms.Then,RDKit is used to preclassify entities according to the neighboring atomic species of different atoms to determine the structure of the entities to improve the interpretability of the extracted entities.The cosine similarity between the classified atomic vectors is calculated,and the similarity threshold of the same entity is set for entity fusion.After fusion,the average of the atomic vectors in each class can be regarded as an entity of the molecular knowledge graph,which represents the atoms of a specific environment.At the same time,the open source package RDKit is used to obtain the chemical bond between atoms and form the triplet {atomic entity1-bond relation-atomic entity 2} that forms the knowledge graph.Finally,the knowledge graph is stored and visualized.Secondly,based on the constructed molecular knowledge graph,this thesis realizes the molecular design application through the link prediction model.In this thesis,the link prediction model Interact E is used,and the encoder WGCN is loaded on the basic model,which makes the results of the link prediction model on the constructed triplet data set slightly improved.The values of Hits@1,Hits@10 and MRR are 0.423,0.476 and 0.635,respectively.By predicting the bonding probability score of different molecules through the model,it is concluded that the scoring standard of reasonable molecular structure is 0.5,that is,if the average score of all bonding relationships within the molecule is above0.5,it indicates that the molecular structure has a high probability of reasonable existence and a high probability of generativity.Thirdly,this thesis applied molecular knowledge graph embedding to molecular property prediction task to test the validity of atomic representation.By adding the atom embedding in the molecular knowledge graph to the initial feature of the atom,the error of adding the atom embedding in the molecular knowledge graph is reduced by 15% based on the framework of message passing neural network(MPNN).The experimental results show that the molecular knowledge graph embedding can effectively represent the atomic environment information which has a positive feedback relationship with the task,which is helpful to improve the quality of the downstream task. |