Font Size: a A A

Construction Of MeSH-based Biomedical Knowledge Graph And Its Application For Omics Data Analysis

Posted on:2020-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:M Q HeFull Text:PDF
GTID:2370330599452356Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the development of high-throughput technologies such as next-generation sequencing and mass spectrometry,a large number of omics datasets(genome,transcriptome,and proteome)have been generated,pushing the biomedical research into the era of big data.These datasets are helpful to understand the underlying physiological and pathological mechanism,as well as the fundamental principles of biological system.However,these datasets analyses are also facing significant challenges.High-throughput gene sets annotation is often the first step for omics datasets analyses,which is helpful to understand genes' overall function,the relationship between genes and diseases,and the regulatory mechanism of gene expression.For gene annotation,a series of knowledge bases like GOA,KEGG,Reactome,OMIM have been established by informatics and manual curation.The construction process of these databases is often labor-intensive and time-consuming,resulting in the low frequency of updates.In addition,these repositories are domain-specific,and only cover a limited number of domains.For example,in many fields(such as behavior and behavior mechanism),there is still no knowledge base.Gene information is still scattered in thousands of biomedical documents.Multiple gene annotation tools such as DAVID,Metascape depend on these databases,and the limitations of these databases greatly affect the usage of these tools.On the other hand,with the explosive growth of the biomedical literature,it has become difficult to acquire information from a large number of newly published literature only by manual.Knowledge graph provides a new approach to extract knowledge from unstructured documents.With the development of techniques of entity recognition and relationship extraction,a variety of biomedical ontology such as MeSH,and automated text mining tools such as PubTator have emerged,which has laid a solid foundation for the construction of biomedical knowledge graph for further gene annotation.This paper aims to address the challenges of gene annotation for large-scale omics datasets:First,a MeSH-based method was developed to construct a literature-derived biomedical knowledge graph,covering multiple biomedical fields.MeSH-PMID and Gene-MeSH correlations were obtained from PubMed and PubTator and then were integrated to extract Gene-MeSH correlations.“Co-occurrence frequency analyses”,“Chi-square test” and NPMI(Normalized Pointwise Mutual Information)were applied to filter statistically significant Gene-MeSH correlations.Orthologs from InParanoid database were used to transfer annotations between different species.As a result,a biomedical knowledge graph was established,covering 11 species,16,629 MeSH entities in 16 biomedical fields,80,756 genes and 2,676,776 Gene-MeSH correlations.By analyzing the cell type graph,it was found that cell types like “Leukocytes”,“Lymphocytes”,“Macrophages” and “Erythrocytes” are correlated with much more MeSH terms.In addition,905 highly reliable human immunosuppression genes,as well as their related diseases,drugs and SNPs were collected as the first human immunosuppression gene database HisgAtlas.Then,an online gene set annotation and enrichment analysis tool(MORE)was developed depending on the biomedical knowledge graph.For the submitted gene list,MORE uses hypergeometric distribution to find the significantly enriched MeSH terms.MORE provides three views for data visualization,including “Table view”,“Tree view” and “DAG view”.MORE also provides a supporting evidence page for genes and MeSH entities.Currently,MORE supports 16 types of MeSH entities for 11 species.MORE is also of the function of automatic underlying database updates by periodically downloading data via APIs,checking data in contrast with the last version,processing data and writing results into the background database.Currently,there are two database versions,“June 1,2018” and “March 1,2019”.In order to evaluate the annotation performance of MORE,it was used to analyze the differentially expressed genes from the experiment of rat calorie restriction.Compared to Gene Ontology,MORE detected much more significantly biomedical terms,such as cell types of “Neurons”,“Neuroglia” and “Astrocytes”,diseases of “Liver Neoplasms,Experimental”,“Mammary Neoplasms,Experimental” and “Diabetes Mellitus”,chemicals like “Glucosamine”,“Galactose” and “Starch”.All these results indicate that MORE can provide more biomedical clues for further experimental design.In conclusion,this paper improves the coverage and efficiency of current gene annotation analysis tools for omics data,and will promote the fusion of omics data and literature,and ultimately accelerate the discovery of biomedical knowledge.
Keywords/Search Tags:Bioinformatics, Knowledge graph, MeSH, Enrichment analysis
PDF Full Text Request
Related items