| The identification of biomarkers related to the occurrence and development of diseases in the genomic data significantly promotes the improvement in diagnosis,treatment research and prevention of diseases.Limited to the collection cost and sequencing capacity,the genomic data are characterized by small sample size and ultra-high dimension.The application of traditional feature selection methods in the genomic data usually pay more attention on the classification performance of features and ignore the biological interpretability of selection results.There are three problems to be discussed in this issue.(1)When screening cancer diagnostic biomarkers in the genomic data,traditional machine learning methods could easily find the genes with good predictive performance,but the biological function enrichment analysis of selected genes obtained by such methods lacked meaningful value for further research because these methods didn’t utilize any biological priors.(2)When screening cancer prognostic biomarkers in the genomic data,popular methods did not take the tumor heterogeneity into consideration,and did not make full use of the actual interactions between genes,which resulted in that the selected genes were lack of specificity for different cancer types,and with no biological significance when they came into survival analysis.Namely,the results were useless in clinical.(3)When screening biomarkers related to type 2 diabetes in the human gut microbiota metagenomic data,due to the collinearity between features caused by ultra-high dimensions and incomplete relevant reference knowledge,the biological interpretability of feature selection results was doubtful.In view of above three problems,the following researches were carried out in this paper.For the problem that traditional feature selection methods cannot obtain valuable biological findings in screening tumor diagnostic biomarkers.In this study,an innovative method called network-constraint infinite latent feature selection(NCILFS)was proposed by combining functional interactions network with machine learning methods.This method improved the ability of the feature selection method in biological interpretation by introducing biological prior knowledge.This method was applied to screening diagnostic biomarkers for five cancer types,the experiment results showed that NCILFS achieved the highest predictive performance in the expression data for four cancer types.The number of oncogenes selected by NCILFS for five cancer types showed the highest significance when compared with other methods.And the Gene Ontology(GO)and Gene Set Enrichment Analysis(GSEA)results showed that NCILFS can find the most biologically significant gene sets.For the problem that the existing network-based methods didn’t make fully use of the regulatory direction and weight of the gene network,and did not consider tumor heterogeneity well across the samples,a clustering-based weighted network feature selection method was proposed in this paper for tumor prognostic biomarkers screening.This method combines two measurements for graph weighting and reduces the tumor heterogeneity through a clustering process.The experiment results for five cancer types showed that the clustering-based method had better prediction performance in most cases,and KEGG(Kyoto Encyclopedia of Genes and Genomes)pathway enrichment analysis and KM(Kaplan-Meier)survival analysis showed that the clustered-based weighted network method could obtain better biologically interpretable gene sets.For the problem of collinearity in metagenomic data caused by the high dimension,in addition to the limited reference knowledge that made the data difficult to analyze,a feature selection method based on the collinearity probability distribution of the data was proposed to be applied in this study.In order to find the most discriminative candidate metabiomarkers in metagenomic data,the method named Iterative Sure Independence Screening(ISIS)was used to screen the biomarkers.This method reduces the influence of feature collinearity by an iterative filtering technique and an approximate unbiased regularization process.This method was employed to screen the biomarkers related to type 2 diabetes in the metagenomic data from Chinese and European.48 representative meta-biomarkers were selected in the Chinese data and 24 representative meta-biomarkers obtained from the European data.The experimental results showed that the highest prediction accuracy of the selected biomarkers in the Chinese data was 0.97 in AUC(Area Under Curve),and the highest prediction accuracy of the selected biomarkers in the European data was 0.99.The biological annotation of the feature selection results showed that there was a significant relation between the selected meta-biomarkers and type 2 diabetes.Moreover,the results showed that the human gut microbiota between European and Chinese are different. |