Font Size: a A A

Research On Key Technologies Of Large Scale Biomedical Knowledge-Driven Intelligent Drug Screening

Posted on:2022-04-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YangFull Text:PDF
GTID:1524307169477204Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The research and development of new drugs is a time-consuming and costly process with a high failure rate.Therefore,it is of great significance to use existing drugs as therapeutics to treat complex diseases and infectious diseases,and an in-depth understanding of disease mechanisms and drug properties is a prerequisite.Through biomedical big data analysis,obtaining domain knowledge about complex chemical molecular structures,biological mechanisms,biomedical entity association network,disease diagnosis,and treatment pathway is an important support for new drug research and development,drug repurposing,and clinical drug decision-making.However,a few problems need to be addressed in order to apply intelligent drug screening,including: the transformation of unstructured information into structured knowledge,integration of curated databases,and exploration of the complex relations between biomedical molecules.In this regard,this thesis focuses on the need for intelligent drug screening,and conducts research on key technologies such as biomedical knowledge extraction,organization,representation,and application: we use text mining and deep learning techniques to mine potential biomedical knowledge from the large-scale biomedical literature and existing databases,and demonstrate it as a knowledge graph for the biomedical field,then build an intelligent model to provide drug screening predict.The main research content of this thesis includes the following aspects:(1)Domain knowledge-empowered pre-trained model for biomedical text miningTo solve the problem of named entity recognition and relation extraction from a large amount of biomedical literature,we proposed a deep learning-based enhanced representation model Bio ERNIE,which effectively uses the knowledge in the biomedical corpus(Pub Med abstracts and full-text PMC articles).By Adding word features and distance features to construct the embedded representation of target entity words and nontarget words,the model’s semantic understanding is effectively approved.Our Models have achieved outstanding success in various text mining tasks(NER,RE)on baseline datasets.The main improvements include biomedical named entity recognition in Gene/protein(JNLPBA: F1-score increased by 4.23%)and species(LINNAEUS: F1-score increased by 0.97%)and relation extraction(DDI: F1-score increased by 1.55%).(2)Text mining-based knowledge graph constructionTo solve the problem of the organization of biomedical knowledge,we use the aforementioned literature mining technology to design a framework for constructing a biomedical knowledge graph and integrate biomedical knowledge based on molecular biology and Traditional Chinese based on Chinese herbs.We employed biomedical named entity recognition approaches to tag genes,diseases,drugs,symptoms,Chinese herbs,and other entities from a large set of domain-specific literature to make the graph’s nodes.At the same time,the rule-based approach and pre-trained model Bio ERNIE are integrated to extract and classified the relations between entities,which are represented as edges to link the nodes in the knowledge graph.In particular,we construct a knowledge graph(Stroke KG)that includes multiple entity types and multiple entity relations for a disease with complex causes and mechanisms of stroke.With 46,983 nodes of 9 types and 157,302 relationships of 30 types,Stroke KG supports the rapid retrieval of biomedical knowledge related to stroke and can provide new ideas for the development of new and targeted drugs and clinical treatment.(3)Matrix factorization acceleration for drug recommendationTo solve the problem of accelerating matrix factorization performance in drug recommendation,we choose the coordinate descent method and propose an efficient and portable CDMF solver for the recommendation system.On the one hand,the diagnostic benchmark is implemented,and it is observed that the existing matrix factorization technology lacks knowledge of the data difference between the hierarchical thread organization on modern hardware and the rows(or columns)of the biomedical interaction matrix.We apply threaded batch processing technology and load balancing technology based on CSR5 to achieve high performance.On the other hand,the CDMF solver is implemented in Open CL to increase the portability of the code.Based on the architectural specification,we customize code variants for each platform to efficiently map it to the underlying hardware.By analyzing gene-based chemical recommendations,diseasebased chemical recommendations,and disease-based phenotype recommendations,the results show that compared with the baseline implementation,our implementation performs 2× faster on Intel Xeon CPUs and 18× faster on an NVIDIA Tesla V100 GPU than the baseline implementations.(4)A large-scale and heterogeneous GCN model for CGI predictionTo solve the problem of chemical-gene interactions(CGIs)prediction in drug screening,based on a large-scale biomedical heterogeneous network we developed Bio Net,a deep biological network model with a graph encoder-decoder architecture.The graph encoder utilizes graph convolution to learn latent information embedded in complex interactions among chemicals,genes,diseases,and biological pathways.Extracting chemical-gene interactions(CGIs)is crucial for screening drugs.We construct a large-scale multi-source entity interaction network of biomedicine named Bio Net,including 7,249 diseases,55,440 genes,14,273 chemicals,2,363 pathways as nodes,and34,005,501 relations as edges.In general,such a massive deep graph model is difficult to train.Bio Net addresses this problem via a parallel training algorithm utilizing multiple GPUs.The evaluation experiments indicated that Bio Net exhibits outstanding prediction performance with the best AUROC of 0.952,the best AUPRC of 0.944,the best AP@20of 0.922,which significantly surpasses state-of-the-art methods.For further validation,top predicted CGIs of cancer and stroke by Bio Net were verified by external curated data and published literature.(5)Applications of intelligent drug screening methodsIn response to the discovery of potential drugs for the Coronavirus disease 2019(COVID-19),we comprehensively use our proposed methods to conduct an intelligent drug screening study for COVID-19.Based on the text mining technology implemented in this article,we extracted structured knowledge of COVID-19 from unstructured texts;based on the knowledge graph construction method proposed in this thesis,text mining results and existing curated data are integrated to construct a knowledge graph related to COVID-19 for analysis of related biological mechanisms.Finally,Bio Net was used to conduct intelligent screening of potential drugs against COVID-19.By comparing drugs screened based on molecular dynamics and other methods(such as dipyridamole,which has significant clinical effects,etc.),the effectiveness of the method system constructed in this thesis was verified.
Keywords/Search Tags:Biomedicine, Biomedical relation extraction, Biomedical relationship prediction, Knowledge graph, Recommendation system, Coordinate descent method, Graph convolutional neural network, Parallel computing, Stroke, New coronary pneumonia(COVID-19)
PDF Full Text Request
Related items