Font Size: a A A

Research Of Knowledge Extraction Method For Biomedical Literature Based On Graph Neural Network And Its Application

Posted on:2024-05-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W ZhengFull Text:PDF
GTID:1520307094976379Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
The rapid development in the modern biomedical field brings about incessant publication of literature on biomedical sciences.The problem facing researchers is how to keep track of such a large amount of scientific literature and summarize the contained knowledge.Therefore,non-artificial methods,especially knowledge extracting methods,are required to help keep researchers updated on published biomedical literature and extract the involved knowledge accurately.In addition,the extracted knowledge also needs to be presented in a human-readable form to provide a high-quality knowledge service,which improves the efficiency of scientific research.A knowledge graph supports functions such as knowledge reasoning,intelligent retrieval,question answering,and knowledge cards.And the reasoning process of a knowledge graph is explicit,which introduces interpretability.Therefore,it is an excellent approach to the organization of biomedical knowledge.Recently,deep-learning-based knowledge extraction methods have been widely used,which significantly improves the efficiency.However,the existing methods underutilize the prior knowledge,such as syntactic and topological features of language,which makes it difficult to fully support high-quality knowledge extraction for literature.Amid globalization,biosafety-related issues are becoming increasingly complicated.At the same time,international biothreats are diversifying,and nontraditional biothreats including bioterrorism and laboratory leaks persist.In particular,the outbreak and spread of the coronavirus disease 2019(COVID-19)takes a heavy toll on human lives and economy.Meanwhile,extensive research has been conducted into various fields of biosafety.Furthermore,multi-source heterogeneous data in the field of biosafety are being continuously proposed,including clusters of scientific literature and knowledge bases.In order to provide a better knowledge service for researchers and improve research efficiency,it is necessary to organize the knowledge in a good manner.However,compared with other countries,research on knowledge graph in the biosafety field focuses on disease diagnosis and treatment in China,and there is a lack of high-quality knowledge graph systems for technology and equipment in the biosafety field.Therefore,it is difficult for researchers to quickly and accurately find out about the latest developments and trends,which impedes research to some extent.To address the above existing problems,we integrated the prior knowledge of language,including syntactic and topological features,into knowledge extraction methods,and designed novel methods for biomedical named entity recognition and chemical-protein interaction relation extraction based on graph neural networks.In addition,under the guidance of biomedical prior knowledge,we designed a knowledge graph schema for technology and equipment in the biosafety field.Moreover,we developed a knowledge graph and an application system for technology and equipment in the biosafety field.Then,we took COVID-19 as a case of application to verify the practicability of the knowledge graph system,which also realized the practical application of the knowledge extraction methods proposed in this thesis,and made it easier for researchers in related fields to access high-quality knowledge services.The main results of this thesis are as follows:1.Most of the existing biomedical named entity recognition(Bio NER)methods are based on the sequence annotation framework,which underutilizes prior knowledge of language,including syntactic and topological features.To solve this problem,we proposed a novel biomedical NER method based on the graph attention network(GAT).This method started by using a graph to model the topology of a sentence before formalizing the biomedical NER task as a node classification problem to introduce the topological and syntactic features of the language.Specifically,the method obtained contextual and syntactic information via a pre-trained language model Bio BERT and a natural language processing tool Spa Cy respectively.Then,it used a multi-head attention mechanism to generate a distributed representation which integrated semantic and grammatical information.The model was evaluated on 8 biomedical NER corpora,including BC2 GM,JNLPBA,BC4 CHEMD,BC5CDR-Chemical,BC5CDR-Disease,NCBI-Disease,Species-800 and LINNAEUS,and achieved F1 scores of 85.15%,78.16%,92.97%,94.74%,87.74%,91.57%,75.01%,and 90.99% respectively.Compared with Bio BERT,this method achieved significantly higher recall on the above8 corpora,with a maximum increase of 3.47%.In addition,this method elevated precision in 7 datasets by as much as 1.10% in average,except Species-800(-1.27%).Moreover,the proposed method increased the F1-score in all 8 datasets compared with BERT,and the increase ranged from 0.43%(BC2GM)to 2.75%(LINNAEUS).Results of experiments on the effect of hyper-parameters on model performance and ablation study show that multi-head attention mechism,syntactic and topological features do help to improve the performance.The above-mentioned experimental results verified the effectiveness of our model,and demonstrated that using a node classification framework and integrating topological and syntactic features of language exactly could improve model performance,which was expected to provide reference for biomedical NER tasks.2.Considering that existing biomedical relation extraction(Bio RE)methods underutilize language topology,and that sequence-oriented deep neural networks have limitations in processing graphical topology data,we proposed in this thesis a novel method for Bio RE based on the graph pointer neural network(GPNN).The method used a graph to model a sentence.Then,it used a natural language processing tool Sci Spa Cy to parse a sentence,which supported the construction of a sentence graph.Meanwhile,it used a pre-trained language model Bio ELECTRA to generate distributed representation of the semantic features for a token,which was modeled as an attribute of the corresponding node in the sentence graph.Since some of the multi-hop neighbors in the graph are important in RE task,our model used a GPNN layer to screen out the most important nodes for each token and generate the optimized distributed representation with topological features.A fully connected neural network was also used to generate distributed representation for the entire sentence.Our method was evaluated on a multi-type RE corpus,CHEMPROT,and achieved precision,recall and F1-score of77.97%,82.07% and 79.97%,respectively.Compared with existing baseline methods,the increase of precision,recall and F1-score in our model ranged from 0.66%,1.76%and 1.19% to 3.47%,11.47% and 7.47% respectively.Meanwhile,our method was also evaluated on binary RE corpora,GAD and EU-ADR,and achieved F1-scores of 83.31%and 83.51% respectively,which were higher than baseline methods.The above showed that the proposed method achieved excellent performances on different RE tasks,which indicated the generalization of the method.As for the effect of different language models on performance,experimental results showed that compared with language models based on BERT architecture,generative adversarial pre-trained models were more suitable for relation extraction tasks.As for the stability of the proposed method,we performed a 10-fold cross validation on CHEMPROT,GAD and EU-ADR.The experimental results demonstrated the robustness of our model,and indicated that our model was stable in different RE tasks.The above-mentioned experiment results verified the feasibility of the proposed model,and demonstrated the ability of the topology of language and the application of GPNN layers to improve model performance,which provided a perspective for extracting relationships.3.To address the lack of high-quality knowledge graphs for technology and equipment in the biosafety field,and to automatically extract biomedical knowledge in heterogeneous data from multiple sources and provide high-quality knowledge services of biomedical significance,we began by designing a knowledge schema for technology and equipment in the biosafety field based on biomedical prior knowledge by utilizing ontology developing guidelines such as the “seven-step method”,and implemented a knowledge graph and an application system for technology and equipment in the biosafety field.The system used a task scheduling mechanism and spider framework as its multi-source heterogeneous data acquisition module,and extracted knowledge automatically by integrating and improving various knowledge extraction methods.It leveraged relational,documental and graphical databases to store different types of knowledge.In addition,the main application module was knowledge retrieval and knowledge cards.Finally,we took COVID-19 as a case of application to verify the practicability of the knowledge graph system.The knowledge graph system integrated knowledge from multi-source heterogeneous data,including biomedical literature clusters,knowledge bases,laws,and guidelines,and obtained more than 619 technologies,2,698 pieces of equipment,67,000 articles,14,000 pieces of news,29,000 terms,800,000 biosafety entities,560,000 relationships,and 1.9 million attributes.The practicability of the system was demonstrated through system function tests including knowledge retrieval and knowledge cards in various retrieval scenarios.The system provided high-quality and practical knowledge services for researchers in the field of biosafety.In conlusion,firstly,novel methods for Bio NER and Bio RE based on graph neural networks and linguistic knowledge were explored in this thesis.Secondly,a schema for technology and equipment in the biosafety field was designed based on biomedical prior knowledge,and in the same time,a knowledge graph and an application system for technology and equipment in the biosafety field was implemented,which took COVID-19 as the application case.Not only novel tools for automatic and accurate biomedical knowledge extraction are provided,but also the vacancy of high-quality knowledge service for technology and equipment in the biosafety field is filled in this thesis.
Keywords/Search Tags:Graph neural network, Named entity recognition, Relation extraction, Knowledge graph, Biosafety
PDF Full Text Request
Related items