Font Size: a A A

Screening,Identification Of Biomarkers And Diagnostic Model Construction For Pancreatic Cancer Based On Bioinformatics

Posted on:2021-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:T D LiFull Text:PDF
GTID:2404330602972622Subject:Public Health
Abstract/Summary:PDF Full Text Request
Pancreatic cancer has become one of the serious threats to human health,its 5-year survival rate is only 4-7%.In recent years,although the diagnosis and treatment of pancreatic cancer has been continuously developed,due to the concealment of early symptoms,most patients have distant metastases at the time of initial diagnosis,and the best opportunity for early surgical treatment has been lost.With the development of next-generation sequencing technology,bioinformatics,multi-omics and machine learning methods have become an important development direction of precision medicine research.Therefore,the exploration,identification and validation of new ideal biomarkers will be of great significance for the diagnosis and treatment of pancreatic cancer.Objectives1.To screen the biomarkers that may play an important role in the development and progression of pancreatic cancer based on bioinformatics and experimental verification in cells and tissues.2.To explore the intrinsic correlation between key genes and pancreatic cancer by combining the analysis of multi-omics,protein interaction network and immune microenvironment,which will provide new ideas for the study of its occurrence and development process.3.To construct the diagnostic model by using machine learning,which will provide a theoretical basis for diagnosis and individualized treatment of pancreatic cancer.Methods1.Searching,downloading and pre-processing of datasets:Datasets related to pancreatic cancer were systematically searched and selected in the GEO,ArrayExpress,TCGA,GETx and ICGC databases.Among them,the gene chip data were downloaded from the CEL raw data,and the transcriptome high-throughput sequencing(RNA-Seq)data were downloaded from the expression counts(counts)matrix.Pre-processing such as quality control,background correction,normalization and gene annotation on the downloaded data were performed.Among them,log scale robust multi-array analysis(RMA)was used to normalize gene chips,TPM(Transcripts per million)was used to standardize RNA-Seq data,and Combat function was used to correct the batch effects between different data.2.Screening potential biomarkers for pancreatic cancer based on bioinformatics1)Weighted gene co-expression network analysis(WGCNA):The use of WGCNA to construct a scale-free network for pancreatic cancer,to find biomarkers closely related to its occurrence and development,was mainly achieved by screening soft thresholds,tailoring gene modules and correlation analysis of gene modules and clinical phenotypes.The genes in related modules were enriched by GO and KEGG analysis to clarify the cell signaling pathways involved in these genes and the possible roles of these pathways in the process of pancreatic cancer.2)Identification of the biomarkers of pancreatic cancer:Cytoscape 3.7.2 software was used to construct a gene-gene interaction network for genes in the key module,and the Hub genes of pancreatic cancer were determined according to the degree.The differential expression analysis of Hub genes was carried out in 8 independent gene chip datasets,and the results were verified in RNA-Seq data.In order to further verify its role in the development of pancreatic cancer,Cox regression was used to analyze the impact of key genes on the prognosis of patients;finally,the CCLE data was used to analyze the expression of key genes in 30 types of cancers including pancreatic cancer,lung cancer,and prostate cancer etc.3.Identification of key genes by experiments:1)quantitative real-time polymerase chain reaction(qRT-PCR):The primers were designed by primer 6.0,and qRT-PCR was used to detect the expression of TSPANI,TMPRSS4,SDR16C5 and CTSE in PANC-1,SW1990 and AsPC-1 pancreatic cancer cell lines.The relative expression abundance was determined by △Ct=Ct(target gene)-Ct(GAPDH).2)Immunohistochemistry(IHC):IHC technology was used to detect protein expression levels of key genes by using seventy pairs of pancreatic cancer tissues and adjacent tissues,and the staining results were graded by two pathologists independently.The semi-quantitative scoring formula used was:H-score=∑pi(i+1);Finally,the expression of its expressed protein in pancreatic cancer tissues and adjacent tissues was compared by paired t test.4.Exploration of the association between key genes and pancreatic cancer:1)Gene mutation,copy number variation and DNA methylation analysis:the data in TCGA was used to analyze the mutation patterns and copy number variation of TSPAN1,TMPRSS4,SDR16C5 and CTSE,and Pearson correlation analysis on copy number variation and gene expression was performed.After quality control of DNA methylation data,differential CpG analysis was carried out.2)Protein-protein interaction(PPI):STRING website was used to perform PPI network analysis,GO and KEGG enrichment analysis for proteins interacting with TSPAN1,TMPRSS4,SDR16C5 and CTSE were performed.3)Correlation analysis of TSPAN1 and KRAS,CDKN2A(p16),TP53 and SMAD4:RNA-Seq data was used to test the correlation between TSPAN1 and high-frequency mutated genes(KRAS,CDKN2A(p16),TP53 and SMAD4)in pancreatic cancer by Pearson correlation,so as to provide a deeper analysis of the possible regulatory role of TSPAN1.4)Analysis of the association between hub genes and immune microenvironment:the deconvolution algorithm in the CIBERSORT function was used to quantify the abundance of 22 immune cells,the difference analysis was performed in pancreatic cancer and adjacent to the cancer,and the correlation of TSPANI,TMPRSS4,SDR16C5,CTSE and differential immune cells were tested by using Pearson correlation.5.Construction and validation of diagnostic model:First,the diagnostic performance of the four key genes was evaluated by using the logistic regression model,and then the diagnosis models of pancreatic cancer were constructed by combining machine learning and 10-fold cross-validation.The validation set and common digestive tract cancer(gastric cancer,esophageal cancer,liver cancer and colorectal cancer)were separately used to evaluate the diagnosis and differential diagnosis performance.6.Statistical analysis methods:The statistical analysis used in this research was completed under R 3.5.3 software.The main software packages used were:affy,oligo,TCGAbiolinks,WGCNA,limma,DESeq2,edgeR,maftools,ChAMP,caret and ggplot2 et al.In all analyses,P<0.05 indicated the result to be statistically significant.Results1.GSE28735,GSE15471,GSE16515,GSE32688,GSE71989,GSE106189,GSE62452,GSE62165 and GSE32676 were obtained from the GEO database,E-MEXP-2780 and E-MTAB-6134 were obtained from Array Express,and RNA-Seq data of pancreatic cancer,normal pancreas and 30 types of cancer cells were obtained from TCGA,ICGC,GTEx and CCLE.2.Screening potential biomarkers for pancreatic cancer based on bioinformatics:1)Eighteen gene modules were identified by using WGCNA,of which yellow-green module is the most relevant module to pancreatic cancer(R2=0.85,P=6.5e-49).The genes in this module were involved in oxidoreductase activity,cyclooxygenase P450 pathway,Glycosphingolipid biosynthesis-milk and new milk series pathways and mucin-type O-glycan biosynthesis pathways,etc.2)Twenty hub genes were identified by using gene-gene interaction analysis.Through the difference analysis of eight gene chip datasets,TSPAN1,TMPRSS4,SDR16C5 and CTSE were identified;It was verified that the four genes were highly expressed in pancreatic cancer tissues from RNA-Seq data,and the difference was statistically significant(P<0.05).Cox regression analysis showed that among the five common digestive tract cancers,TSPAN1,TMPRSS4,and SDR16C5 were only related to the survival of pancreatic cancer.RNA-Seq data at the cellular level also showed that TSPAN1,TMPRSS4,SDR16C5,and CTSE were highly expressed in pancreatic cancer cells.3.Identification of key genes by experiments:1)qRT-PCR results showed that TSPAN1,TMPRSS4,SDR16C5 and CTSE showed moderate and high expression abundance in the three pancreatic cancer cells,among which TSPAN1 and CTSE showed high expression abundance,and their △Ct values were less than 12.2)IHC results showed that the expression levels of TSPAN1,TMPRSS4,SDR16C5 and CTSE in pancreatic cancer tissues were higher than those in adjacent tissues,and the differences were statistically significant(P<0.05).The expression levels in pancreatic cancer tissues and adjacent tissues showed as follows:7.27 ± 0.31 and 6.88 ± 0.14;7.16 ± 0.24 and 7.02 ± 0.13;7.15 ± 0.24 and 6.99 ± 0.14;7.00 ± 0.26 and 6.76± 0.09.4.Exploration of the association between key genes and pancreatic cancer:1)Gene mutation analysis showed that TMPRSS4,SDR16C5,and CTSE had mutations in pancreatic cancer(all missense mutations);copy number variation analysis showed that copy number variation of TMPRSS4 and CTSE were related to mRNA expression level(P<0.05).DNA methylation analysis found that TSPAN1,TMPRSS4,SDR16C5 and CTSE were all hypomethylated in pancreatic cancer.2)In PPI analysis,the proteins interacting with TSPAN1 were involved in cell cycle,p53,cancer,pancreatic cancer and other important signaling pathways,and TSPAN1 was also involved in the regulation of KRAS,SMAD4 and TP53.3)The correlations between TSPAN1 and KRAS,CDKN2A(p16),TP53 and SMAD4 were statistically significant,and the correlation coefficients were 0.67(P<0.001),0.36(P<0.001),0.48(P<0.001)),0.15(P<0.05),respectively.4)Immune microenvironment analysis found that:Plasma cells,T cells CD8,Monocytes,Macrophages M0,Macrophages M1,Macrophages M2,Dendritic cells activated were different in pancreatic cancer tissues and adjacent to the cancer(P<0.05).Moreover,the four key genes were statistically correlated with the Macrophages M0 and Macrophages M1(P<0.05).5.For these four key genes,The AUCs of each them were all greater than 0.872.The eight diagnostic models based on TSPAN1,TMPRSS4,SDR16C5 and CTSE all showed high accuracy with above 90%.Among them,the accuracy of random forest,neural network and flexible discriminant analysis algorithm in the validation set were up to 100%.However the accuracy in gastric cancer,esophageal cancer,liver cancer and colorectal cancer were less than 0.60.Conclusion1.This study identified four key genes closely related to pancreatic cancer based on bioinformatics analysis:TSPAN1,TMPRSS4,SDR16C5,and CTSE.All of them were experimentally verified in cells and tissues.2.Through multi-omics,protein interaction and immune microenvironment analysis,the possible internal relationship between key genes and pancreatic cancer,and the possible important role of TSPAN1 in the development of pancreatic cancer and its clinical potential were found.It provides a new understanding perspective for the occurrence and development of pancreatic cancer and a theoretical basis for the basic research of pancreatic cancer.3.The pancreatic cancer diagnostic models with an accuracy of more than 90%were constructed based on machine learning.The accuracy of random forest,neural network,and flexible discriminant analysis models in internal verification were as high as 100%,which provides a theoretical basis for the early diagnosis of pancreatic cancer.
Keywords/Search Tags:pancreatic cancer, bioinformatics, Machine learning, Weighted Gene Co-expression Network Analysis(WGCNA), Multi-omics
PDF Full Text Request
Related items