Font Size: a A A

Research On Identification Of Pan-cancer Common Driver Modules Based On Imputation Data

Posted on:2022-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:C WuFull Text:PDF
GTID:2504306770471894Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Different cancer types share certain commonalities,this commonality is reflected in the possibility of sharing some important pathways.Mutations in any of the driver genes in the pathways can lead to inactivation of these pathways and lead to the development and progression of cancer,this reveals a common oncogenic cause of different cancers at the molecular level.Therefore,identifying common driving pathways among different cancers is crucial for designing effective and generalized cancer treatments,which has been made possible by the generation and accumulation of massive omics data.On the one hand,analysis of pan-cancer data reveals similarities between genomic profiles across different cancers.On the other hand,by integrating multiple cancer omics data such as genomics and proteomics,the problems of noise and incomplete data caused by single omics data can be reduced.Because of the combination of proteomics,the interaction between proteins will be considered,so the driver module is another description of the driver pathway based on protein interaction.The main work of this paper is as follows:In terms of data processing,since pan-cancer data are often generated by identifying driver modules from the intersection of gene sets in different cancer samples,some genes that are critical for cancer development may be overlooked.For solving this problem,a K-nearest neighbors based imputation algorithm KNNImp is firstly devised to infer the variation values for some potential significant missing genes.The algorithm first selects the imputation gene,and then calculates the similarity between samples,and judges whether the imputation gene is mutated by the mutation of the corresponding gene of the imputation gene in the top K samples with the highest similarity.Finally,select the preferentially imputed sample from all possible mutated samples to predict the mutation of the sample in the imputed gene.Using real biological data,the driver module identification algorithm is used on the original data set before processing and the imputation data set after imputation processing,and the identified driver modules are compared and analyzed.The experimental results on biological data show that the presented imputation algorithm does play roles in regaining some important cancer related genes and can improve the identification efficiency of the module identification algorithm.In terms of module identification,in this paper,we consider the differences in mutation frequency among cancers for the identification of common driver modules in pan-cancer.In this paper,the mutual exclusion and coverage are defined based on the harmonic mean,identify a set of non-overlapping module sets,and a PCDMSS(pan-cancer common driver module set score)model that maximizes the pan-cancer common driving module set score is proposed.For this model,a method for identifying common driver modules in pan-cancer based on hierarchical clustering HMCEwalk is proposed.It weights the integrated PPI network with the harmonic mean of gene coverage scores and mutual exclusion scores among various cancer types,and extracts modules through a random walk process.Experiments were implemented on both simulated data and real biological data.The experimental results on simulated data indicate that given two types of cancers,the HMCEwalk algorithm has a stronger tendency to identify a set of modules which not only mutate in a large proportion of samples of these cancers,but have close proportion of mutated samples for each cancer.The experimental results on biological data indicate that,compare the identification performance with two state-of-the-art computational methods MEXCOwalk and Drive Ways,the proposed method exhibits competitive performance in most cases in terms of recovering known cancer genes,producing modules that have satisfied coverage and mutual exclusivity for each cancer.Many identified modules are embroiled in known cancerrelated biological pathways.In addition,the proposed method does recognize many cancer related genes missed by methods MEXCOwalk and Drive Ways.In summary,this paper studies the pan-cancer common driver module identification problem,the gene imputation algorithm KNNImp is proposed,the pan-cancer common driver module identification model is redefined,and the pan-cancer driver module identification method HMCEwalk is proposed to solve this problem.
Keywords/Search Tags:Pan-cancer data, Common driver module, Data imputation, Hierarchical clustering
PDF Full Text Request
Related items