Font Size: a A A

Research On Gene-Mutation-Disease Relation Extraction Technology For Precision Medicine Knowledge Base

Posted on:2020-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:F TongFull Text:PDF
GTID:1364330599452435Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
With the explosive data growth and striding technology development in biomedical field during the age of precision medicine,it has become inopportune and unrealistic for knowledge base construction in traditional completely artificial way.To solve this problem,information extraction and knowledge mining from massive literature has turned into a research focus and application hotspot in recent years,and important progress has been made in text mining task including named entity recognition,term extraction,relation extraction,event extraction and coreference resolution due to the unremitting efforts from academic and industrial circles.However,when facing with the specific task of "gene-mutation-disease" relation extraction for precision medicine knowledge base construction,the current approaches,models and algorithms still have limitations and problems mainly in the following five aspects,which fail to meet the practical need:(1)Named entity recognition algorithm highly depends on feature engineering,where feature selection,representation and pre-processing are time-consuming and labor-intensive,and critical lexical and syntactic features are buried in rich feature set containing part-of-speech,dependency and context features;(2)Relation type definition simply focuses on biomedical background,where classification based on strength or probability cannot express the hyponymy between the relations,and the top-level relation type and bottom-level relation trigger signal word vocabulary are absent for guiding relation mapping and assisting relation locating;(3)The lack of standard corpus and efficient corpus construction tools on "gene-mutation-disease" research have this process manually done by experts,who need make judgments with their understanding and knowledge both on the relation status and relation types between entities in different locations and with different mentions,and timeliness and subjectivity have great influence on the scale and quality of the corpus;(4)Relation extraction algorithm mainly concentrates on simple relations,bi-nary,intra-sentence,and two-category classification may develop into n-ary,inter-sentence multiple-category classification by applying entity extension,coreference resolution and hierarchy classification,but simultaneously introduces cascading error and extra noise affecting performance and efficiency;(5)Knowledge graph construction and presentation particularly emphasizes on sole source,where it is challenging to guarantee the richness of knowledge expression and the persuasiveness of knowledge evidence,and it is difficult to facilitate the visualization of relation network and traceability of relation sources,and it is unlikely to meet the need of guidance to understanding and assistance to research of end users.The goal of this thesis is to introduce natural language processing technology and biomedical literature as well as knowledge base,to optimize the gene,mutation and disease entity recognition algorithms,to construct relation types and relation extraction corpus for "gene-mutation-disease" text mining task,to design "gene-mutation-disease" relation extraction algorithm,and to develop "gene-mutation-disease" knowledge graph construction,curation and visualization platform with knowledge from diverse sources.Below is a summary of the key research findings:(1)We propose and implement an integrated model combining deep neural network and traditional recognition method for disease named entity recognition.Based on Bi-directional Long Short-Term Memory model,the algorithm consists of natural language pre-processing,character/word embedding representation,deep neural network prediction,Viterbi algorithm optimization and dictionary mapping correction.It effectively solves the problem that named entity recognition algorithm highly depends on feature engineering.Tested on NCBI Disease corpus,precision,recall and F-score achieve 89.16%,90.00% and 89.58% respectively,which are higher than known state-of-art models.(2)We propose and implement an integrated method and processing procedure combining unsupervised clustering with ontology guidance for "gene-mutation-disease" relation type definition.Based on UMLS Semantic Network,the method includes data screening and acquisition,natural language pre-processing,biomedical domain named entity recognition,open relation extraction,manual curation,clustering analysis and top-level ontology mapping.It fully takes advantage of the extensive coverage of "open relationship type",high-level induction of semantic hierarchical clustering and strict restriction of top-level ontology,and effectively solves the problem that relation type definition simply focuses on biomedical background which fails to apply in biomedical text mining.We finally establish a 5-layer,16-category relation types and 58 commonly-used relation trigger vocabulary,with coverage of 94.12% and 95.08%,respectively,which is sufficient to express the main "gene-mutation-disease" semantic relationship.(3)We propose and implement an integrated method combining distant supervision and expert curation for "gene-mutation-disease" corpus construction,develop a relation extraction corpus curation platform and construct a relation extraction corpus.Based on ClinVar knowledge base,the method combines full-text literature acquisition,natural language pre-processing,biomedical domain named entity recognition,open relation extraction,relation type mapping and expert curation.It fully takes advantage of authorized knowledge guidance by distant supervised knowledge base,and effectively solves the problem that relation extraction corpus construction mainly relies on expert curation.The corpus covers 527 full-text biomedical literature,3,366 entity mentions,963 relation instances,the mappable,unmappable,and non-existed relations account for 61.83%,4.84% and 33.33% respectively,and the final average inner-annotator agreement achieves 85.14%,which meets the initial quantity and quality requirements of corpus construction.(4)We propose and implement an integrated algorithm combing domain prior information and pre-trained language model for "gene-mutation-disease " relation extraction.Based on Google BERT model,the algorithm consists of domain vocabulary extension,unsupervised pre-training,supervised fine-tuning,dependence parsing analysis and knowledge base distance supervision.It fully takes advantage of dynamic feature captured by neural network and accurate positioning indication by prior knowledge,and effectively solves the problem that relation extraction algorithm mainly concentrates on simple relations and fails to deal with multiple categorical classification problems.Tested on relation extraction corpus,precision,recall and F-score achieve 71.46%,73.21% and 72.32% on "irrelevancy/association/predisposition/consequence" four-category classification,which performs better than general-domain BERT model and common relation extraction models.(5)We propose and implement an integrated method combing entity linking and relation mapping for "gene-mutation-disease" knowledge base construction and develop a knowledge graph construction and visualization platform.Based on Gene,MedGen,ClinVar,PubMed Central knowledge base,the method implements the function of data import,data integration,data storage,data presentation,knowledge retrieval and knowledge visualization.It fully takes advantage of the deep integration of knowledge,corpus as well as literature and comprehensive presentation of text,table as well as graph,and effectively solves the problem that knowledge graph construction and presentation particularly emphasizes on sole source,failing to represent knowledge and provide evidence.The platform now covers 1,463 gene names,21,711 mutation names,2,201 disease names,590 relation instances and 527 full-text literatures.Simulated by genetic screening and differential diagnosis application scenarios,the platform shows potential to guide scientific research and assist clinical decision making.
Keywords/Search Tags:Precision Medicine, Text Mining, Named Entity Recognition, Relation Extraction, Knowledge Base Construction
PDF Full Text Request
Related items