Font Size: a A A

The Study On Driver Pathways Based On The Cancer Genomics Data

Posted on:2019-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:B GaoFull Text:PDF
GTID:1360330572956692Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Several large-scale projects such as The Cancer Genome Atlas(TCGA),International Cancer Genome Consortium(ICGC)are now sequencing genomes in thousands of samples from dozens of cancer types and have generated huge volume cancer genomics data.A key challenge in interpreting these data is to distinguish the functional driver mutations important for cancer development from random passenger mutations that have no consequence for cancer.In addition,cancer is a disease of genes and pathways.It is even a more challenging problem to identify the cancer driver genes and pathways.To ultimately determine whether a mutation is a driver or a passenger,its biological function should be tested.However,the current ability to detect somatic mutations far exceeds the ability to validate experimentally their function.Consequently,computational approaches that predict driver mutations are now an urgent priority.In this study,we mainly focus on how to apply classic combinational optimization strategies to identify cancer driver mutations,genes and pathways,which will benefit the studies of molecular mechanisms and pathogenesis underlying cancer and also the future cancer therapy.The increasing vast amount of cancer data clearly provides an unprecedented opportunity for studying cancer.However,it has been a great challenge to design effective computational methods for analyzing the cancer data.The heterogeneity of cancer mutations significantly reduces the power of detecting driver mutations and mutated genes by identifying recurrent mutations and mutated genes in a large cohort of cancer patients.Except for the presence of passenger mutations in each cancer genome,one of the main biological explanations for the mutational heterogeneity is that driver mutations typically target groups of genes in cellular signaling and regulatory pathways.The different mutations in different patients contribute to the dysfunction of the same driver pathways.Some methods are developed to examine mutations and look for enrichment in the context of pathways or functional groups.The limitation of the methods is that the knowledge of the dependent pathways and functional groups remains incomplete.According to the current understanding of the somatic mutational process of cancer,the somatic mutations in a driver pathway exhibit two combinatorial patterns of mutual exclusivity and high coverage.The driver pathway prediction algorithms based on the two combinatorial patterns include combinatorial and probabilistic methods.However,the exclusivity of the gene sets identified by the current combinatorial methods cannot be ensured and the probabilistic methods are facing a bottleneck in efficiency.The applications of the current methods are quite limited in practice.In addition,it may not be functionally related for the predicted gene set with the combinatorial patterns identified by the cancer mutation data only.Integrating more different types of data may improve the prediction accuracy.For example,some methods identify recurrently mutated subnetworks in protein-protein interaction networks.However,it remains a highly challenging problem to integrate multiple types of data to systematically identify pathways with the two patterns of mutually exclusivity and high coverage.In this thesis,we propose a novel method CovEx by integrating both cancer mutation data and protein-protein interaction networks to systematically identify driver pathways of mutual exclusivity and high coverage.Our method follows the steps below.We first construct an influence network by a random walk model based on a protein-protein interaction network.The topological correlation degrees of the edged gene pairs in the influence network should be strong.To identify gene sets in driver pathways,the search is limited in local influence networks.The mutation dataset will be studied in depth by searching for a number of local networks independently.Then considering the trade-off between mutual exclusivity and coverage of gene sets,we design a two-step procedure to search for and screen out gene sets.We obtain candidate gene sets by solving a series of linear programmings based on a linear combinatorial function for each local network.The candidate gene sets are further evaluated and selected by a newly designed non-linear evaluation function.The main objection to the linear combinatorial function is that some identified gene sets are not exclusive.The function values of the gene sets are dominated by genes of high mutation frequencies.Instead,the new designed function ensures that each gene in the gene set significantly contributes to the function value.The candidate gene sets can be effectively evaluated by the new function.Finally,due to the mutation heterogeneity in different cancer patients,we apply the minimium set cover model to predict the patient specific driver gene sets and further the driver functional modules or pathways.The individual prediction will be of great importance for the development of individualized treatment for cancer.We analyze a pan-cancer mutation dataset of twelve cancer types and also analyze the mutation dataset of each cancer type,respectively.Especially,we apply three protein-protein interaction networks from different databases for each dataset,respectively.To further improve the prediction accuracy,we design a consensus method to analyze the results for three protein-protein interaction networks comprehensively.The consensus method corrects the results for each protein-protein interaction network and determines the consensus driver pathways of different results.We analyze the sensitivity and accuracy of results obtained under different parameters or protein-protein interaction networks based on the benchmarks of cancer genes annotated by different databases.The accuracies have been improved for the results after correction of the consensus method.Furthermore,the sensitivity and accuracy of CovEx have been proved to be much higher than those of the classical method HotNet2.We also predict the driver pathways and GO functional modules by enrichment analysis to known pathways and GO functional modules.The method CovEx still suffers from some shortcomings as follows.1)The function values of some identified gene sets are dominated by a few genes of large mutation frequencies.The quality of these gene sets is hard to be guaranteed.2)The current version of the software has not been parallelized.The parallelization of the linear programming solved in different local networks should be realized.3)Considering the nonlinear property of the newly designed function and the trade-off between coverage and exclusivity of gene sets,the new function is only used to screen out the gene sets identified based on the linear function instead of identifying gene sets based on the new function directly.Therefore,the important gene sets in some local networks may not be identified.Identifying gene sets based on the new function and analyzing results of different methods may further increase the ability to identify driver pathways.Due to the shortcomings of CovEx,we further introduce a method UniCovEx to systematically identify the important mutually exclusive gene sets with balanced exclusive coverages in cancer.Comparing to the general concept of mutual exclusivity of gene sets,the concept of mutual exclusive gene set with balanced exclusive coverages helps to identify the real mutated gene sets in driver pathways.We propose the concept of exclusive entropy to evaluate the target gene sets and design corresponding algorithms.The experiment results show that UniCovEx can serve as an effective complement of CovEx.In addition,we design comCovEx to identify the common driver gene sets and pathways for different cancer types.Based on comCovEx,we further study the relationships of different cancer types.The methods CovEx and UniCovEx have been implemented by C++.The software is freely available at https://sourceforge.net/projects/cancer-pathway/files/.
Keywords/Search Tags:cancer genomics, protein-protein interaction network, coverage, mutual exclusivity, driver pathway
PDF Full Text Request
Related items