| Objective:To screen the differentially expressed genes that can distinguish between TB patients,patients with latent TB infection and healthy population by bioinformatics method using the data of TB samples in the gene expression database,and to construct a model to distinguish different TB infection status for validation,so as to provide a theoretical basis and a powerful tool for early and definite diagnosis of TB.Methods:The GSE19491,GSE37250,and GSE42834 datasets were preprocessed by downloading and annotation,and the three datasets were combined using an empirical Bayesian approach to remove batch effects.Differentially expressed genes from TB patients,patients with latent TB infection and healthy population were screened for GO function and KEGG pathway enrichment and protein interaction network analysis.The differentially expressed genes common to the 3 populations were screened using the Lasso method with descending and screening to obtain the core differential genes.Ten fold cross-validation and 100repetitions were used to train the validation model based on the core differential genes.The model accuracy and the precision,recall and F1 values of the 3 populations in the model were calculated by 3×3 confusion matrix to comprehensively evaluate the validation effect of the model.Results:(1)After data pooling and processing,there were487 samples and 19144 genes in the combined dataset;with|log2FC|>1.5 and FDR value<0.05 as the threshold,a total of 787 differentially expressed genes were found in TB patients and healthy population,355 differentially expressed genes in TB latently infected patients and healthy population,and 129 differentially expressed genes in TB patients and TB latently infected patients.A total of 1296 differentially expressed genes were found in patients with latent TB infection and 1296 differentially expressed genes were found in patients with latent TB infection;differentially expressed genes were mainly enriched in biological pathways such as innate immune response,immune response,secretory regulation of immune system processes,defense response and cell activation;protein interaction networks were significantly enriched(P<1×10-16).(2)There were 34 core difference genes among TB patients,TB latently infected patients and healthy population,which were reduced to 18 after Lasso feature selection;among them,HRK and PNMA3genes were up-regulated in TB patients and TB latently infected patients,while the expression levels of the remaining genes were down-regulated in TB patients and TB latently infected patients;the prediction model constructed by linear discriminant analysis had the highest accuracy.The prediction model constructed by linear discriminant analysis had the highest accuracy and F1 value;10 core differential genes such as EPSTI1,SLC26A8 and GBP6 were potential biomarkers that could distinguish between TB patients,TB latently infected patients and healthy people.Conclusion:Bioinformatics methods combined with data mining techniques can help determine the disease progression of pulmonary tuberculosis at the molecular and genetic levels,which has certain feasibility and application value and provides methods and ideas for accurate diagnosis of pulmonary tuberculosis. |