Font Size: a A A

Differential Gene Screening And Bioinformatics Analysis Of Pulmonary Sarcoidosis Based On Random Forest Algorithm

Posted on:2022-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:M C WuFull Text:PDF
GTID:2480306560499434Subject:Public Health
Abstract/Summary:PDF Full Text Request
Objective:This study aims to use the pulmonary sarcoidosis gene expression profile data collected in the GEO database,apply the method of combining the SAM algorithm and the random forest algorithm to screen differential genes,and use bioinformatics analysis methods to analyze the pulmonary sarcoidosis gene expression profile Carry out correlation analysis,study the genes related to pulmonary sarcoidosis and identify its key regulatory genes,provide a new perspective for the study of the etiology and pathogenesis of sarcoidosis,and promote the health of the population.Methods:The bioinformatics analysis of pulmonary sarcoidosis gene expression profile data in this study can be roughly divided into three parts: data preprocessing,differential expression gene screening,gene function enrichment analysis and PPI analysis.First,use the online GEO database to find and download the raw data of pulmonary sarcoidosis from the database,and use the robust multi-array averaging algorithm(RMA)to perform data standardization and other preprocessing on the raw data of pulmonary sarcoidosis to obtain the gene expression matrix.Then,use the SAM algorithm to perform preliminary differential screening on the processed matrix data,and obtain differentially expressed genes through preliminary screening.Subsequently,after establishing a random forest classification model based on the differential expression matrix data obtained by the initial screening of the SAM algorithm,the gene importance score is given according to the random forest algorithm,and the final differentially expressed genes are screened.In addition,this study also compared the stability of the used screening method with the common differential gene screening method moderated t-statistics method.After screening the differentially expressed genes,we used the DAVID database and the KOBAS database to perform GO and KEGG enrichment analysis of the differential genes to study the function of the differentially expressed genes and the enrichment of the pathways.Finally,use the STRING database to construct a protein-protein interaction network(PPI),use Cytoscape software to visualize the PPI network,and use the Cyto Hubba program to find key genes related to pulmonary sarcoidosis.Results:After preliminary screening by the SAM algorithm,9268 differentially expressed genes were obtained.Subsequently,a random forest classification model was established,and the importance of differentially expressed genes was ranked,and 466 important differentially expressed genes were screened.For the random forest algorithm to screen and obtain differentially expressed genes,we performed GO function enrichment analysis and KEGG pathway enrichment analysis.Through GO enrichment analysis,it is found that the differentially expressed genes of pulmonary sarcoidosis are mainly involved in the biological processes of blood glucose homeostasis,RNA splicing regulation,redox reaction,bronchial cartilage development,tissue homeostasis,etc.;In terms of cell composition,differentially expressed genes are mainly enriched in the mitochondrial membrane space;in terms of molecular functions involved,differentially expressed genes are mainly involved in molecular functions such as hydrolase activity.Through enrichment analysis of KEGG pathway,it was found that differentially expressed genes are mainly related to metabolic pathways such as pyrimidine metabolism,ABC transporter pathway,chemokine signaling pathway,Jak-STAT signaling pathway,c GMP-PKG signaling pathway,and resistance to EGFR tyrosine kinase inhibitors Pathway,Yersinia infection pathway,human cytomegalovirus infection and RNA transport pathways.Through the STRING database,we constructed a protein-protein interaction network diagram of 466 differentially expressed genes,and obtained 376 nodes and 390 edges.After visualization by Cytoscape software,the top 20 genes sorted by the four algorithms in the cyto Hubba program were overlapped,and a total of 6 core genes were obtained,namely AKT1,STAT3,ALYREF,PA2G4,CTGF and IL13.Conclusion:Compared with common robust t-test screening methods,the combination of microarray significance analysis(SAM)and random forest algorithm has better accuracy in screening differentially expressed genes,and a total of 466 differentially expressed genes were obtained.The GO and KEGG enrichment analysis of the differentially expressed genes obtained by the screening showed that the results were basically consistent with the existing studies.PPI network and hub gene screening were constructed for differentially expressed genes of pulmonary sarcoidosis,and the key genes regulating the occurrence of pulmonary sarcoidosis were AKT1,STAT3,ALYREF,PA2G4,CTGF and IL13.STAT3 and IL13 genes have been confirmed in related studies.
Keywords/Search Tags:Gene expression profile, Differentially expressed genes, SAM algorithm, Random forest, Bioinformatics
PDF Full Text Request
Related items