| Classification of cancer is a major research topic in medicine.Although the current medical level can achieve a cure rate of about 85%for early-stage cancer patients,the cure rate is very low for advanced cancer patients,and even if it can be cured,it can only basically live for about 5 years after being cured.Early detection is the most effective means to cure cancer.The prediction of cancer classification based on informatics is of great significance and practical value in overcoming the defects of the diagnostic morphology and imaging methods of cancer,and for the early and accurate diagnosis of cancer patients.In cancer classification prediction,the current research on cancer classification is mainly based on gene expression profile data.The TCGA database is rich in cancer types and has a large sample size.The TCGA database not only provides easy and non-access restricted gene expression profile data,but also provides data for many other OMICS technologies.In this thesis,we present a cancer classification prediction model based on gene expression data and DNA methylation data.We speculate that the combination of methylation and gene expression data may change the classification results and obtain important features.Because the resulting model not only reflects differences in the transcriptome,but also reflects differences in epigenetic levels.The gene expression data and DNA methylation data provided by the TCGA database have high dimensionality,small sample size,high noise,and few normal samples.Firstly,In this paper we use the SMOTE method to balance the number of normal samples and the number of tumor samples,Then,we use ten-Fold cross-validation,feature selection using the Minimum Redundancy Maximum Relevance(MRMR)method for each training set,We obtain classification model using machine-learning algorithms SVM.Finally,the classification results are obtained.In this paper,we combine gene expression profiling data and DNA methylation data,and explore the fusion of data sets,the handling of imbalance problems,and the construction of classification models through experiments and comparative studies.We used the TCGA breast cancer gene expression profile and DNA methylation data for cancer classification studies.According to the result of Experiment 1,it is verified that the classification model be constructed directly for unbalanced datasets,which will lead to 100%over-fitting or 0%under-fitting of the classification results.Experiment 3 shows that we can use only a few features to achieve more than 98%of classification results for each assessment indicator. |