Since mankind entered modern society,the prevalence of cancer has increased year by year.Cancer has developed into one of the greatest threats to human life and health.Among them,lung cancer is the most common.The incidence and mortality of male lung cancer patients are different from those of female lung cancer patients.Among men,lung cancer is Cancer has the highest incidence and mortality.Among women,breast cancer is the number one killer of cancer,followed by lung cancer.With the advancement of computer technology,there has been considerable development in the medical field.The medical diagnosis model trained using machine learning and various gene sequencing data of patients has greatly improved the diagnosis efficiency.Based on the differences between men and women in lung cancer,gender information should be an important part of training and optimizing diagnostic models,but most existing studies do not make full use of this information.This article makes a comparative study between gender-specific models and gender-independent models.Use three feature selection algorithms and five classification algorithms(Logistic Regression,Gaussian Naive Bayes,Random Forest,Support Vector Machine,Adaboost)To evaluate the contribution of gender information to early detection of non-small cell lung cancer.Lung adenocarcinoma(LUAD)and lung squamous cell carcinoma(LUSC)are the two main subtypes of non-small cell lung cancer,accounting for about80% of all lung cancers.This article first downloads the transcriptome data and corresponding clinical data of LUAD and LUSC from the TCGA(The Cancer Genome Atlas)database.Combine the two data sets to retain samples with transcriptome data,cancer development stage,gender information,return visit time,death time,and survival status.The development stages of cancer Ⅰ and Ⅱ are regarded as early samples,and the development stages of cancer Ⅲ and Ⅳ are regarded as advanced samples.The final data set consists of LUAD(393 early,110 late)and LUSC(406 early,91 late)samples.Data preprocessing and statistical analysis were performed on the two cancer data sets,and the Chi-square test and Spearman correlation coefficient(SCC)did not show a significant correlation between gender and cancer stage.According to survival analysis,patients diagnosed as early have a better survival rate.T-test feature selection algorithm is used to select and sort the features,and five classification algorithms are evaluated.As the number of selected features gradually increases,the classification performance of Gaussian Naive Bayes gradually declines,while the support vector machine has been maintained relatively Good classification performance.Therefore,the support vector machine will be used as the evaluation model in the following text.Subsequent trials showed that the sex-specific model is better than the sex-independent model of early lung cancer.Using the filtered feature selection algorithm T test,the gender-independent model in LUAD uses 93 features to reach a diagnostic accuracy of 0.8012,while only 75 features in women reach an accuracy of 0.8529,and men use 64 features.Achieving an accuracy of 0.8788,the Venn diagram shows that men and women only share a few transcriptome biomarkers for early lung cancer.The above conclusions were also verified on other feature selection algorithms and the LUSC data set,and verified using an independent gastric cancer data set.The experimental data in this article indicate that gender information should be used in optimizing cancer diagnosis models.Finally,a biomarker recommendation algorithm program for early diagnosis of cancer based on gender information is encapsulated.Users only need to input the training data set and label set in the format required by the program to obtain five classifications of different feature subsets under the specified feature selection algorithm.The classification performance file,biomarker recommendation file,and biomarker Venn diagram of the device are convenient for researchers to use. |