Font Size: a A A

Research On EQTL Data Processing And Functional Analysis Methods

Posted on:2021-07-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1480306569983419Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of genome sequencing technology and the decline in sequencing costs,genome sequencing projects for large populations have been continuously launched,and the number of genome variants discovered has been unprecedentedly increasing.How to effectively decipher and understand the molecular function of genomic variation is an important problem that needs to be solved urgently in the functional genomic era.And it is of great significance for understanding and studying the molecular mechanism of diseases and other important traits,and discovering drug targets.Expression Quantitative Trait Loci(eQTL)analysis has become an important way to solve this problem.By analyzing the relationship between the genomic variation and the intensity of gene expression in a large number of samples,it is possible to investigate the functional impact of genomic variations on gene expression levels in genome-wide scale.However,eQTL data is characterized by high noise,incomplete data,and complex biological function relationships.The existing eQTL data analysis methods are not effective enough to mine and reveal the important biological functions contained in it.There is an urgent need to establish more effective data processing and functional analysis methods.This thesis focuses on eQTL data processing and functional analysis methods.The main contents include the following aspects:First,considering the characteristics of eQTL raw data such as multi-source heterogeneity,prevalence of abnormal samples,and high noise caused by confounding factors,we systematically built a set of eQTL raw data quality analysis and cleaning methods through the integration of multiple statistical models and machine learning methods,including data quality analysis,abnormal data detection,confounding factor adjustment,and data normalization.The method solves the problem of quality control and standardization of multi-source heterogeneous eQTL raw data,effectively reduces manual intervention,reduces data noise,and improves data quality.Second,considering the prevalence of missing statistics in eQTL summary datasets,this thesis proposes a method for inferring missing eQTL data based on multivariate Gaussian distribution.This method models eQTL statistics using multivariate Gaussian distribution based on the linkage disequilibrium relationships among DNA variations and infers missing eQTL statistics using known eQTL statistics.Through genome-wide fragmentation processing and construction of dynamic linkage disequilibrium relationship matrix,this method is suitable for parallel analysis,and can effectively improve efficiency of inferring missing data.This method does not require raw genotype data and raw gene expression data,and is suitable for inferring missing data of multipleQTL types.At the same time,this method can break through the lower bound threshold of minor allele frequency in original eQTL analysis,which can effectively increase the amount of eQTL signal discovery.This method is conducive to improving eQTL summary data integrity and further enhancing biological discoveries in eQTL studies.Third,aiming at solving the problem of mining regulatory patterns between eQTL and multiple genes,this thesis proposes a method for eQTL regulatory network construction and motif mining.This method uses the adaptive permutation testing strategy and fitting the null distribution of the eQTL mediation effects by the generalized Pareto distribution,to quickly and accurately calculate the eQTL mediation effects,and integrates eQTL mediation effects with the eQTL bipartite network and the gene regulation network based on mutual information to construct the eQTL regulatory network.On this basis,the thesis uses eQTL regulation network motif mining method based on heuristic subgraph isomorphism discrimination,to reduce the computational space of subgraph isomorphism discrimination,improve the computing efficiency of network motif mining,and discover the regulatory patterns between eQTL and multiple genes.This method broadens the research thinking of traditional gene regulatory patterns,and provides novel method support for the research of regulational functions between eQTL and genes.Fourth,aiming at mining disease modules from biomolecular networks integrated from multi-omics data,this thesis proposes a method of mining disease functional modules by integrating eQTL networks and multi-omics data.The method first uses the eQTL network as bridge to integrate disease phenotype and genomic variation association data(GWAS),transcriptome gene expression and genomic variation association data(eQTL),and protein-protein interaction data(PPI),and construct complex molecular network integrated from multi-omics data information.On this basis,graph representation learning model is used to extract the features of the network;hierarchical clustering method is used to unbiasedly detect the functional modules;and the enrichment level of the disease genes in the modules is used to prioritize the functional modules.This method is helpful to study disease molecular pathways,and the functional impact of eQTL in disease molecular pathways from perspective of system biology.
Keywords/Search Tags:Expression quantitative trait loci(eQTL) analysis, Biological network analysis, Gene expression data, Genotype data, Data quality control
PDF Full Text Request
Related items