Font Size: a A A

Ontology-based Integration Knowledgebase Construction And Knowledge Discovery Of Lung Cancer Genetic Information

Posted on:2021-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z G CuiFull Text:PDF
GTID:1364330611992125Subject:Health Service Management
Abstract/Summary:PDF Full Text Request
Objective: Lung cancer is one of the most rapidly growing and life-threatening malignancies.According to data released on the WHO websiteWorldwide cancer killed 8.8 million people in 2015,with 1.69 million deaths from lung cancer,or nearly 20 percent of the cause of cancer,.Research on genetics and variation for lung cancer has always been a focus of experts in related fields.As of March 2019,there were more than 100,000 articles related to lung cancer in Pubmed.With the accumulation of lung cancer related data and knowledge,it provides a good knowledge basic for further data mining.In order to organize and manage the literature data and gene expression data of lung cancer by more effective method and realize the acquisition and utilization of related knowledge,this study intends to extract the association pattern between the biological entity of lung cancer and the disease type,genome and clinical feature information from the expression data and unstructured text data.Then we seamlessly integrated multi-source data using method of ontology and semantic network.Furthermore we constructed the integrated knowledge base of lung cancer genetic information to achieve efficient management of lung cancer genetic and mutation knowledge and fine-grained annotation.The knowledge base can provide knowledge services and decision support to field researchers and realize prediction of lung cancer gene regulation network and key genes by data-based reasoning.This study is divided into three parts,which are the mining and integration of genetic information and clinical feature information data with lung cancer,the construction of ontology knowledge base of lung cancer genetic information,and the demonstration for application of ontology knowledge base to construct lung cancer casc8 gene interaction network based on semantic technology.Subjects and Methods: The subjects of this study were mainly literature data and gene expression data related lung cancer,which were downloaded from Pubmed and TCGA(LUAD and LUSC).The source of required dictionary data is from UMLS,Entrez Gene,FARNA,LNCipedia,mi RBase,et al.The interaction between miRNA and mRNA/lncRNA was taken from miRWalk and lncBase databases.During the mining and integration phase,this study identified lung cancer-related biomedical entities from unstructured text data and extracted the relationships among them through text mining methods.Gene expression data related with lung cancer were analyzed by bioinformatics methods,including differential expression analysis,WGCNA and survival analysis,to identify the differentially expressed genes,co-expression genes and find out the association between over survival and differentially expressed genes.The process of text mining included dictionary construction,corpus preparation,POS,analysis of dependent syntax,named entity recognition,entity relation extraction.The dictionaries included disease,mRNA,miRNA,lncRNA and clinical feature information.The software tools used in the study included Python,Stanford core NLP,Pubtator.Named entity recognition was based on dictionary and grammar rules,and integrated the identification results and the data from Pubtator.This study used R language to complete analysis of gene expression data.The expression data was corrected using DEseq2 packet,and differential expression analysis was performed by edgeR packet for gene.Then WGCNA analysis and survival analysis for differential expression genes was processed.Finally,we used cytoscape 3.6.1 to draw the ceRNA network.In the knowledge base construction stage,this study follows "five guidelines " of the ontology construction,and built the ontology knowledge base using the " seven-step method ".The language of knowledge representation is OWL.we constructed the model of knowledge base by Protege5.20,and managed the related data by MySQL5.6.26,D2RQ2.0,Apache Jena TDB database software.The integration of data and model,and further reasoning was realized by Apache Jena 3.14.Finally,this study used the linked data in the integrated knowledge base to construct the gene interaction network of CACS8 gene and screen the candidate related genes.Result:1.Data source results.A total of 107718 articles were downloaded from pubmed in this study.515 RNA-seq data,513 mi RNA-seq of LUSC and 501 RNA-seq data,478 miRNA-seq of LUAD were downloaded from TCGA database.2.Text mining results.The corpus includes 981396 sentences.989136 entities were found in Named entity identification process,595694 entities were obtained after integration and washing with Pubtator data.A total of 51661 pairs of relationships were obtained after screening within the scope of the sentence.There are 30532 gene and clinical relation,4786 gene and mutation,11771 gene and lung cancer type,1750 mutation and lung cancer type,and 2822 mutation and clinical information among them.Using the dependency syntactic graph analysis,49032 triples of "entity-relational verb-entity" was obtained.3.Results of gene expression data.By differential expression analysis,a total of 2501 mRNA differential genes were found in the LUAD project,of which lncrna differential genes were upregulated 1958,downregulated 543;lncrna differential genes were 1503,of which 1296,downregulated 207;and mirna differential genes were 118,of which 98,downregulated 20.a total of 3488 mrna differential genes were found in the lusc project,of which 2,318 were upregulated and 1170 were downregulated;1687 were lncrna differential genes,of which 1425 were upregulated and 262 were downregulated;and 170 were mirna differential genes,of which 143 were upregulated and 27.differential genes were screened according to a significant level of log-rank test p <0.05.luad had 541 mrna,120 lncrna,and 13 mirna associated with os;lusc had 774 mrna,335 lncrna,and 19 mirna associated with os.constructing the lucad differentially expressed gene cerna network included mrna,mirna,lncrna node 39,23,120,including lncrna-mirna relationship pair 506,mirna-mrna relationship pair 50,and lusc differentially expressed gene cerna network included mrna,mirna,lncrna node 55,28,722,including lncrna-mirna relationship pair 4532,mirna-mrna relationship pair 68.4.Ontology construction.After building the databse and the mapping file,we got a total of 2755697 triples through conversion from relational database to RDF database.The main concepts included genes,variants,disease types,clinical feature,relational verbs.In addition to mRNA,miRNA,lncRNA subclasses,gene class was defined other subclasses,including the subclasses related with the expression of genes,clinical feature,and lung cancer type.The class of lung cancer type and the category hierarchy is classified using the pathological classification of lung cancer.The subclasses of the clinical feature class are diagnosis,investigation,disease pathological process,treatment,prognosis.The relationships between classes of the knowledge base mainly include the relationships between entities classes(genes,variations,disease types and clinical features),the relationships between entities and biomedical texts,the co-expression relationships of genes,and the targeting relationships of miRNAs and mrRNAs/lncRNAs.Six object properties are defined in the ontology model,which are association,target,tageted,co-express,is?in,and include.After combining the ontology model with the triples data,the reasoning rule is defined to verify the ontology model,to reclassify the genes.In the application instance section,we retrieved “CSAC8 gene” from the knowledge base by SPARSQL query.A total of 73 co-expression lncRNAs,16 targeting miRNAs,127 ceRNA network mRNAs,and 1 SNP(rs10505477)related with CASC8 gene were obtained.There are 21 diagnostic genes,13 preventive genes and 76 therapeutic genes among these genes.The semantic network of CASC8 associated with lung cancer was built,and the gene interaction network of CASC8 was also built by ceRNA network principle.Conclusion:1.we successfully identified lung cancer-related biomedical entities,and extracted of relationships between entities from lung cancer-related unstructured literature data using textmining method.2.The seamless integration of of lung cancer genetics knowledge is realized by the text mining of unstructured literature data and bioinformatics analysis of gene expression data.3.We achieve fine-grained annotation of lung cancer-related genes and variations by data-based reasoning and the genes reclassification.4.According to the feedback from experts,the knowledge base of lung cancer ontology can provide knowledge service and decision support for epidemiological research and clinical research.5.We demonstrated how to construct the gene interaction network using knowledge base and visualized the knowledge base information.The semantic network and gene interaction network of CASC8 to provide a potential theoretical basis for the study of relevant mechanisms.
Keywords/Search Tags:lung cancer, ontology, text mining, data mining, knowledge base, dependent syntax analysis, knowledge management, CASC8
PDF Full Text Request
Related items