Font Size: a A A

Research On Semantic Query Method Of Genome Variation Data Based On Ontology

Posted on:2020-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:S H SuFull Text:PDF
GTID:2370330590974456Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of human DNA sequencing technology and the advance of large-scale sequencing projects such as the 1000 Genomes Project,biomedical data has shown an explosive growth trend.A large amount of genome variation data has reached the TB level or even PB level.Large-scale genomic variation data provides a data base for biomedical research,but it also brings challenges in the storage,processing and analysis of big data.Traditional databases have certain advantages in processing small data,but they do not adopt to storage and query processing of genomic variation data above TB level.Currently,HBase and Spark have attracted extensive attention from academia and industry in the field of large-scale data processing due to the HBase's ability of dynamic expandable storage and Spark's advantages of efficient parallel data processing.Therefore,in the face of the growing mass of genome variation data,how to carry out efficient extensible storage management and query analysis on it and find out the biomedical knowledge and rules contained by genome variation data is the difficult problem in current research.In view of the fact that disease similarity can be used to intuitively and quantitatively measure the correlation between diseases,the evaluation method of disease similarity and the semantic query method combined with disease similarity have become the research hotspot at present.In order to effectively measure the similarity between newly discovered diseases or diseases with little genetic information in current medical research,this paper proposes a rule-based method for calculating the similarity of Disease Ontology(DO).By comprehensively considering the influence of disease-related genes and phenotypes on the similarity measurement,the experiment proves that this method achieves good performance in terms of ROC.Besides,aim to effectively find similar diseases on highly unbalanced data sets,this paper proposes a disease similarity calculation method based on deep neural network(DNN).The experiment shows that the method achieved good results in terms of ROC and PRC.This paper proposes storage and query methods of genome variation data based on Spark and Hbase and build the non-primary key index mechanism and the query optimization method based on Lucene.Besides,this paper proposes a semantic query method for massive genome variation data based on the disease similarity network which is constucted by the proposed measurement methods of disease similarity in the paper.The experimental results show that compared with storage and query processing methods of traditional databases,the storage and query methods proposed in this paper have obvious advantages in large-scale genomic variation data.
Keywords/Search Tags:genomic variation data, disease similarity, ontology, semantic query, big data, query optimization
PDF Full Text Request
Related items