| With the completion of the Human Genome Project,gene chip technology is widely used in various fields of scientific research,and life science has entered the era of whole genome and big data.A large number of gene expression data obtained based on gene chip technology are publicly stored in many professional databases,which provides the necessary data support for related research of gene expression data.On the other hand,as one of the most deadly diseases for humans,scientists and medical workers have always been committed to the early diagnosis and treatment of cancer,and gene chip technology provides a basis for exploring the molecular characteristics of cancer.Today,gastric cancer has become the fourth most commonly diagnosed cancer with the third highest mortality rate in the world.Therefore,it is very valuable to select features for the gene expression data of gastric cancer patients and use the characteristic genes to analyze the clinically relevant data of the patients.This article downloads the gene expression data and corresponding clinical data of gastric cancer patients in the TCGA database on the GDC platform.After the initial data set is integrated and cleaned,a feature selection method combining filtering and encapsulation is used to select feature genes.By constructing a classification model,it is verified that the selected feature gene set can effectively classify the samples.The results of survival analysis by combining the characteristic gene set with clinical data of gastric cancer patients also show that the expression level of characteristic genes in the gene set can significantly affect the survival of gastric cancer patients.The main work of this article can be divided into the following three parts:1.Selection of gastric cancer characteristic genes.Gene expression data has the characteristics of high dimensions and small samples,so there is often serious redundancy between features or a large number of noisy genes.First,the spearman rank correlation coefficient,signal-to-noise ratio index,and autocorrelation coefficient are used for the initial screening of gene filtering,and a large number of redundant genes and noise genes are quickly eliminated.Then establish three characteristic gene selection models to obtain three sets of characteristic genes.The T test method with improved P value retains 163 feature genes with significant differences in sample categories,the random forest method retains 277 genes with importance greater than 0,and is selected based on the random forest recursive elimination method(RF-RFE)21 characteristic genes were selected as the optimal subset.2.Establish a classification model to evaluate the classification effect of the three groups of characteristic genes.In order to select the optimal set of the three sets of characteristic gene sets as the final gastric cancer characteristic genes,a nonlinear support vector machine with BRF kernel is used as a classifier to establish a model for identifying the first stage of gastric cancer tumors.Using the number of genes as a cycle,the effect of different number of genes on the classification effect in the characteristic gene set of each group was investigated.The gene set of the RF-RFE method group contains only 21 characteristic genes,which can achieve 95.24% accuracy,100% specificity,and 0.94 specificity,which is a good data in only 67 samples.Classification effect.The performance of the T-test method and the random forest method group is not as good as the RF-RFE group,so in the end,21 genes of the RF-RFE method group are selected as the optimal feature genome.3.Survival analysis of gastric cancer patients.In order to make the research more meaningful,we can give a more practical analysis of the selected characteristic genes.The important features of genomics,transcriptomics,or proteomics of these characteristic genes were retrieved on the Internet,and at least 7 genes were found to be related to the prognosis of the human digestive system or gastric cancer.Taking 4 genes as an example and carrying out survival analysis with clinical data,the results show that the expression levels of the four genes will affect the survival rate of gastric cancer patients over time,of which the expression levels of GPX2,GKN2,ATP11 A have an effect on the survival rate.The impact is very significant.This article effectively selects genes for gastric cancer patient samples,and combines clinical data to analyze the characteristics of genes affecting the survival of gastric cancer patients.The results show that the selected feature gene sets can effectively identify patients with gastric cancer tumor stage 1,and the expression level of the feature genes also significantly affects the survival rate of gastric cancer tumor patients over time.Therefore,the research content and conclusions of this article have certain value for the early diagnosis and timely treatment of gastric cancer. |