Font Size: a A A

Identification Of Cancer Driver Genes Based On Machine Learning

Posted on:2022-05-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L XuFull Text:PDF
GTID:1524306818977409Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With its high morbidity and mortality rate,cancer is currently a significant disease affecting human health.Progress in modern biotechnology has made it possible to study the pathogenesis of cancer at the genetic and molecular levels.Driver genes enable cells to gain selective growth advantages and play significant roles in promoting the occurrence and development of cancer.Identifying driver genes is of great significance for cancer diagnosis,drug development,prognosis judgment,and precision medicine.However,the identification of driver genes through biological methods is expensive and time-consuming.With the development of second-generation sequencing technology,genome projects such as the cancer genome atlas(TCGA)have provided researchers with a large amount of gene sequencing data of cancer samples.Analyzing these data and identifying driver genes through computational methods can narrow the range of candidate driver genes and provide a valuable reference for further experimental verification and clinical research.Due to the differential driver genes of patients with the same cancer type and the characteristics of low samples and high dimensionality in gene mutation data,identification of driver genes based on computational methods faces considerable challenges.This dissertation uses machine learning methods to analyze sequencing data and identify cancer driver genes.The main work includes the following aspects:1.To overcome the instability of algorithms that estimate the background distribution by random sampling,an algorithm based on a neural network model is proposed to identify driver genes with protein functional impact mutations.Firstly,the BP neural network model is used to establish the nonlinear relationship between genetic characteristics and the functional impact scores of genes and predict the functional impact scores of genes.Secondly,genes are clustering by hierarchical clustering based on genetic characteristics.The maximum likelihood estimation method is used to fit a Gamma distribution to obtain the background functional impact score distribution in each cluster.Finally,driver genes are identified by using a significance test based on background distribution.The average deleterious mutation ratio of driver genes identified on 31 datasets was 0.8368.The cancer gene census(CGC)and the network of cancer genes(NCG)average accuracy of the algorithm applying to the 31 TCGA cancer mutation datasets are 55.62% and 86.85%,respectively.Experimental evaluations demonstrate that the proposed algorithm achieves better performance than the other 21 driver gene identification algorithms.2.Genes are grouped into different signalling pathways through interaction relationships.Based on the above research on independent driver genes,a robust adaptive driver gene set identification algorithm is proposed to identify a set of genes that promote the development of cancer.So that the problem that strong mutual exclusivity leads to unbalanced mutation patterns in the gene set is solved.By analyzing the mutation patterns of the cancer signalling pathways,it is verified that gene coverage is highly positively correlated with overlap contribution in one gene set,which means genes with high mutation frequency always have co-mutations with other genes.Therefore,appropriate overlaps should be allowed for identifying driver gene sets.An adaptive weight that is negatively correlated with mutation frequency is introduced to regulate the mutual exclusivity of genes with different mutation frequencies.Besides,the leave-one-out subsampling strategy is combined with a genetic algorithm to construct a robust optimization model.The experimental results on three cancer mutation datasets show that the driver gene sets identified by the proposed algorithm show high coverage under the premise of ensuring mutual exclusion.The driver gene sets identified by the proposed algorithm are enriched in critical cancer signalling pathways such as Erb B,MAPK,and PI3K-Akt.The perturbation experiment results on the lung adenocarcinoma dataset show that the ability of the proposed algorithm to resist data disturbance is better than the other four similar algorithms.The proposed algorithm identified the same driver gene set with 75% and 81% frequencies when ten zeros were replaced by ten ones,and ten ones were replaced by ten zeros in the mutation matrix of lung adenocarcinoma.3.For the mutation data with a small number of samples,the mutual exclusivity weight of mutation patient cardinality may cause deviations.Therefore,based on the robust adaptive model above,an algorithm based on multi-omics analysis is proposed to identify driver gene set.Firstly,the expression level is added to the mathematical programming model to regulate gene mutation exclusivity by analyzing mutation frequency factors.Driver gene sets are identified by combining the genomics and transcriptomics information.The experimental results on the lung adenocarcinoma dataset show that the identified driver gene sets obtain high coverage and mutual exclusivity,and are enriched in Erb B,MAPK,and non-small cell lung cancer signalling pathways.Besides,an algorithm based on information entropy is proposed to eliminate the interference of irrelevant genes in the original mutation data to the driver gene set identification algorithm.A set of most valuable mutation categories is identified by minimizing gene mutation information entropy.The candidate gene set and the corresponding mutation matrix are obtained by the mutation categories.The algorithm determined the five most valuable mutation categories on the ovarian cancer mutation dataset and reduced the number of candidate genes from 9901 to 471.In addition,genes with high mutation frequency and low mutation frequency are both retained.The algorithm overcomes the interference of irrelevant genes on the driver gene set identification algorithm.4.The above driver gene identification algorithms can provide references for targeted drug therapy of cancer.However,cancer patients show different sensitivity to anticancer drugs,and the selection of gene features related to anticancer drug response faces the dimensionality catastrophe.Because of this,an algorithm based on the Autoencoder network is further proposed to identify anticancer drug response-related driver genes.An Autoencoder network is built to evaluate the contribution of gene features by weights and reduce the feature dimensionality preliminarily.After that,the Boruta algorithm is used to select driver genes that significantly impact the drug sensitivity of cell lines.Besides,the Easy Ensemble sampling method is applied to the imbalanced datasets for feature integration processing,making full use of information with more samples.Finally,a random forest classifier is used to predict the drug sensitivity of cell lines based on the selected driver genes.The experimental results in targeted medicines of lung cancer PLX4720 and BIBW2992 show that the algorithm identifies driver genes associated to lung cancer and targeting signalling pathways.The algorithm obtains the average area under the curve(AUC)of 0.7116 and 0.8210 in genomics of drug sensitivity in cancer(GDSC)and cancer cell line encyclopedia(CCLE)databases,which is better than the other four similar algorithms.
Keywords/Search Tags:Bioinformatics, Cancer driver gene, Multi-omics analysis, Machine learning, Autoencoder network
PDF Full Text Request
Related items