| The prediction of pathogenic genes and the study of pathological mechanisms are of great significance.Accurate prediction of diseasecausing genes is expected to help us gain a deeper understanding of disease mechanisms and provide an important molecular basis for early detection,drug development,disease prevention and treatment.At present,GWAS has discovered many risk genes,but it is still difficult to effectively explain the disease phenotype,indicating that there are still new genes to be discovered in complex diseases such as Alzheimer’s disease.This also proposes the challenge of developing new methods for disease gene prediction.In recent years,significant progress has been made in the methods of disease gene prediction based on functional gene networks.However,existing methods have limitations in terms of network accuracy,tissue specificity,and prediction methods.Therefore,in view of the above problems,this thesis focuses on developing methods on the construction of gene function network and on disease gene prediction.The main work and innovations are as follows:(1)Considering the lack of tissue specificity of genomic data and the limitation of prediction accuracy of the current tissue-specific networks,this thesis proposes a method to construct tissue-specific functional gene networks(FGNs)based on integrating tissue-specific genomic data via XGBoost,in order to improve network accuracy and further improve disease gene prediction accuracy.Mapping gene interactions within tissues/cell types plays a crucial role in understanding the genetic basis of human physiology and disease.Tissue FGNs are essential models for mapping complex gene interactions.A database of 49 human tissue/cell line FGNs constructed by integrating heterogeneous genomic data is presented.The heterogeneous input genomic data are integrated via XGBoost because Bayesian classifiers,which is the main approach used for constructing FGNs,cannot capture the interaction and nonlinearity of genomic features well.A total of 1,341 RNA-seq datasets containing 52,087 samples are integrated for all of these networks.Because the tissue label for RNA-seq data may be annotated with different names or be missing,intensive hand-curation is performed to improve quality.A userfriendly database is further developed for network search,visualization,and functional analysis.The application of TissueNexus in prioritizing disease genes is illustrated.The database is available at https://www.diseaselinks.com/TissueNexus/.(2)NetWAS is a computational framework to predict disease genes by making use of FGNs and GWAS disease genes through machine learning approaches.NetWAS has been proved successful in the field.However,it processes networks without any preprocessing,which might affect the prediction power of the models.Here,we propose a method called SKPCA,which is based on similarity ensemble of random subspaces to improve the disease gene prediction accuracy.SKPCA is based on the assumption that two samples(genes in this context)are more similar if they show similarity in multiple random subspaces.First,the method extracts multiple feature subspaces randomly from the original feature space.Second,for each subspace,SKPCA performs dimension reduction to remove high correlation between genes and to reduce noise by using Kernel PCA.Third,for each subspace,the similarity between genes is calculated with Random Forests.Finally,a similarity matrix of genes by SKPCA is obtained through ensemble.We test SKPCA in predicting genes of 12 diseases.The results show that SKPCA outperforms benchmarking methods.(3)We propose a graph convolutional network(GCN)-based approach for predicting disease-associated genes,called linear modelintegrated GCN(LIMO-GCN).It integrates a linear model with GCN to account for both linearity and nonlinearity in real-world data simultaneously.The reason to use GCN is that it is by design naturally suitable to dealing with network data and can exploit the graph structure of gene networks.The motivation to integrate a linear model is that real-world data can be theoretically decomposed into the sum of a linear part and a nonlinear part and that the linear part can be best modeled by a linear model because a nonlinear model is biased and may be easily overfit.The weighted sum of the prediction of the two components are used as the final prediction of LIMO-GCN.Applied to the prediction of disease genes,LIMO-GCN outperforms the state-of-the-art approaches including GCN,network-wide association studies,etc.Further,we show that the top-ranked genes are significantly associated with diseases based on molecular evidence from heterogeneous genomic data.Our results indicate that LIMO-GCN provides a novel method for prioritizing disease genes.(4)To solve the gene prediction problem of complex diseases(such as obesity)associated with multiple tissues,a method based on multi-tissue gene network integration SKPCA-LIMOGCN is proposed.In this approach,disease-related tissues are first selected through literature analysis.Second,the SKPCA algorithm is used to perform the integrated calculation of the similarity of random subspaces for multiple tissuespecific functional gene networks,and then the weighted average of the obtained similarity matrix is carried out.Third,the weighted average similarity matrix is used as the adjacency matrix of the LIMO-GCN model,combined with multi-omics features,to predict disease genes.Obesity is a systemic disease with many tissues associated with it,and the prediction accuracy of existing methods is limited.Therefore,this paper applies the proposed method to obesity-related gene prediction.Six relevant tissues were selected through the literature:adipose,brain,pancreas,liver,kidney,and blood.This paper compares GenePlexus,RWRM,SKPCA and LIMOGCN,and the results show that SKPCA-LIMOGCN has a higher prediction performance.The ablation analysis shows that the selected networks outperform the random tissue-specific networks with adiposespecific network,indicating that the importance of disease-associated tissue-specific functional gene networks.The case analysis shows that the obesity genes predicted by SKPCA-LIMOGCN are associated with obesity.Finally,this paper analyzes the tissue specificity of obesity genes,providing the tissue-specific genes associated with obesity.(5)In the current complex disease research,the development of a disease gene association analysis platform based on multi-omics and medical data will provide great convenience for biological or medical experts to study candidate genes from different perspectives and screen genes for testing.However,there is still a lack of a disease gene visualization analysis platform that integrates different omics and clinical data and analysis methods.For complex diseases Alzheimer’s disease(AD)and atrial fibrillation(AF),this paper collected relevant omics and medical data,and constructed disease gene visualization analysis platforms AlzCode and AFLink,respectively.First,AlzCode integrates a variety of functional genomic data,including protein interactions,gene regulatory networks,miRNA-target interaction networks,gene expression data,single-cell RNA-seq data,protein expression,and AD clinical data(such as CERAD score,Braak score,and dementia score).Moreover,the AlzCode platform integrates various disease gene association evaluation and visualization analysis methods designed for various types of data,and provides the function of analyzing individual genes and gene sets.For AFLink,it integrates various data such as AF-specific gene function networks,AF-related regulatory networks,gene and enhancer networks,phenotype associations,and drug targets.In terms of methods,AFLink provides a series of methods such as different types of network visualization,differential expression analysis visualization,etc.,providing an integrated and easy-to-use platform for multi-perspective analysis and screening of AF genes.The two platforms constructed have important implications for understanding the genetics and pathological mechanisms of AD and AF,and the platforms are available at http://www.alzcode.xyz/and https://diseaselinks.com/AFLink/. |