For different subtypes of tumors,the implementation of specific and precise clinical treatments can effectively improve the quality of clinical prognosis.Accurately identifying tumor subtypes and understanding the immune escape mechanisms dominated by different tumor subtypes are important preparations for precision medicine.However,the epistasis between gene and structural mutations in the entire tumor genome and the heterogeneity in tumor samples make tumor data mining is facing severe challenges.The results of traditional association studies have some shortcomings,such as false positive,difficult to explain and lack of heritability,and the tumor subtypes identified by computatioanal methods are quite different from clinical tissue typing.This study believes that on the basis of in-depth mining of whole-genome SNP data,fusion of other levels of omics data will help to systematically and completely understand the process of tumor formation;constructing a biomolecular interaction network and identifying its key subnet characteristics will help improve tumor subtype classification or risk assessment model performance;identifying important genes with significant expression differences between different subtypes,and understanding the regulatory mechanism of their expression variation,can provide support for subsequent precision immunotherapy.Therefore,the research in this study includes the following four research contents:(1)A method for epistasis and heterogeneity analysis based on the maximum correlation-maximum agreement criterion has been proposed: First,it is oriented to genome single nucleotide polymorphism(SNP)data,using the principles of Bayesian network score K2 and information entropy.The maximum correlation-maximum consistency criterion was designed to comprehensively and comprehensively evaluate the epistasis of the genome;with the increase of the number of SNPs,the combination space increases sharply,leading to the phenomenon of combinatorial explosion.Therefore,an improved genetic algorithm is proposed to search the SNP epistasis space heuristically to determine a variety of potential susceptibility epistasis combinations.It is worth noting that a variety of epistatic combinations correspond to different pathogenic gene combinations,which may lead to different tumor subtypes,that is,heterogeneity;finally,the XGBoost classifier uses characteristic SNPs for training,and these characteristic SNPs contain multiple The susceptibility sites in the group epistasis combination,and then verify the hypothesis that considering tumor heterogeneity can help improve the accuracy of tumor subtype prediction.In order to prove the effectiveness of this method,the effectiveness of epistatic recognition and the accuracy of tumor subtype classification were evaluated.A large number of simulation results show that this method is more effective and has better prediction accuracy than previous research methods.(2)A collaborative representation method based on kernelized convex hull has been proposed for tumor diagnosis.Most of the early tumor diagnosis methods based on microarray data of high-dimensional small samples face challenges such as over-fitting and high false-negative rates.The method in this paper first constructs the sample to be tested as a special convex hull containing only one element,and then uses a training data set containing samples of different categories to cooperatively represent the convex hull.In addition,the kernel function method is used to deal with the high-dimensional,non-linear separability and other problems of tumor samples.This study compares and analyzes 11 different tumor sample classification algorithms on 11 public tumor expression profile data sets.The experimental results show that our method has certain advantages in terms of accuracy and computational efficiency.(3)Proposed a multi-source data fusion framework to reveal the regulatory mechanism of breast cancer immune escape: Identify breast cancer subtypes based on immune-related genes,which helps to understand the immune escape pathways dominated by different subtypes,so as to implement effective for different subtypes treatment measures.To this end,this study uses non-negative matrix factorization and a consistent clustering algorithm on The Cancer Genome Atlas(TCGA)RNA-seq breast cancer data,and identifies four important subtypes based on a priori immune-related genes.Then,the breast cancer samples in TCGA and normal tissues from non-cancer individuals in the Genotype-Tissue Expression(GTEx)database were subjected to differential expression analysis to identify important immune genes related to immune subtypes.Then,the correlation analysis between copy number variation(CNV)and immune gene mRNA was carried out,and based on ATAC-seq data,the regulatory mechanism of immune genes that could not be explained by CNV was studied.The experimental results found that CDH1 and PVRL2 are in all four types.The subtypes of immune evasion pathways all play an important role,and the expression variation of CDH1 is mainly caused by its own CNV,while the expression variation of PVRL2 is more likely to be regulated by transcription factors.Finally,estimate the composition of infiltrating immune cells for clusters of different immune subtypes,and compare the differences in immune escape mechanisms between different clusters.(4)A gene expression trait-CNV association analysis method based on gene interaction network clustering and group sparse learning has been proposed: The current interpretation method of gene expression variation faces several major challenges such as low explanatory power,insufficient prediction accuracy,and lack of biological significance.This study proposes a new computational method to explain in depth the causes of differences in the mRNA expression of breast cancer susceptibility genes from the perspective of genomic structural variation.First,some high-risk genes related to breast cancer were collected,and then a ranking-based strategy was designed to preprocess the copy number variation(CNV)and mRNA data of breast cancer.Secondly,in order to enrich the biological significance of this method and avoid the explosion of combinatorial space,we introduce a priori gene interaction network and apply network clustering algorithm to identify high-density sub-networks.Finally,in order to describe the relationship between the feature subnet and the mRNA expression of the target gene,a group sparse learning model was proposed to explain the relationship between CNVs and the difference in expression of pathogenic genes.The experimental results show that our method not only significantly improves the accuracy of target gene expression abundance prediction,but also the pathway enrichment analysis further validated that related CNV genes are also related to the occurrence and development of breast cancer. |