Cancer is a complex disease that poses a serious threat to human life.Breakthroughs in high-throughput sequencing technology have reduced research costs in cancer diagnosis,clinical treatment,and prognosis prediction.Integrating multi-omics and high-throughput data makes it possible for systematic and comprehensive analysis of cancer research,and the process of cancer generation is also studied in a deeper and more complete way.From the perspective of genetics,the generation of cancer is the result of continuous selection and accumulation of genetic mutations.Therefore,the integration of multi-omics data to mine cancer-related genes and cancer driver genes has become a hot spot for studying cancer pathogenesis.This paper proposes two methods to identify driver genes,and the main work includes the following two points:(1)A method based on overlapping community detection(GCommunity)is proposed to mine gene communities with overlapping characteristics and identify driver genes related to cancer.Firstly,EMDomics is used to analyze the differential expression of cancer data with high heterogeneity,and the genes with significant differential expression are selected as the input genes.Then,Gibbs sampler is used to construct gene interaction network for gene expression data,and protein-protein interaction(PPI)data is added to complete the information of gene interaction network.The overlapping community detection algorithm is used to mine the final gene community.The candidate driver genes of cancer are selected by frequency calculation of copy number variance.Then the regression tree model is applied to establish a regulatory mechanism between the candidate driver gene and gene communities to obtain the cancer driver genes.GCommunity method obtains the interaction relationship between genes from genomes and proteomes data,analyzes the mutation behavior of genes from copy number variation data,and establishes the regulatory relationship between mutant genes and gene communities with the probability statistical model.The experimental results show that GCommunity can mine the high-quality gene communities with biological significance,and the identified driver genes have driving significance.(2)A somatic mutation-based cancer driver gene detection method(MaxSIF)is proposed,which integrates gene expression data,protein-protein interaction data,and somatic mutation data.Firstly,the method uses the correction factor to remove the background noise of the silent mutation.Then the mutation score of the nucleotide is calculated by the proportion of nonsense mutations,missense mutations,frame-shift indels and in-frame-del in the nucleotide sequence.The gene interaction network composed of the expression data and the protein-protein interaction data is combined with the mutation score.The mutation influence score of the gene and the neighbor node is calculated,and the mutation influence score of the gene is represented by the maximum value.Finally,the gene with the high SIF is selected as the driver gene.The motivation for the MaxSIF approach is that if two genes both have a high mutation score and are close to each other in the gene network,they should have strong mutation effects.The method takes into account the mutational effects of all neighbors in the gene network to calculate the maximum mutational effects of the gene.The experimental results show that the driver gene recognized by MaxSIF method can be significantly enriched in the cancer pathway,which can correctly identify the driver gene and distinguish the oncogene and tumor suppressor genes. |