Font Size: a A A

Systematic Discovery Of Biomarkers And Construction Of Predictive Models For Non-small Cell Lung Cancer By Using Bioinformatic Techniques

Posted on:2019-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X ShiFull Text:PDF
GTID:1314330545462426Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background and objectiveThe incidence and mortality of lung cancer rank first among malignant tumors worldwide,and around 85%of lung cancers are non-small cell lung cancer(NSCLC).Due to the paucity of effective early diagnostic approach,a lot of patients have missed the optimal treatment time frame and were already in the advanced stage at their first diagnosis.The continuous improvement of high-throughput sequencing technologies and analytical methods has provided novel concepts and approaches for the study of cancer biomarkers in lung cancer.The development of cancer is a complex biological process with multi-gene participation,multi-factor interaction and multi-stage development.This process involves the mutations in proto-oncogenes,changes in transcript expression profiles,and abnormalities of protein structure,function,or expression levels.The study of the molecular mechanism of lung cancer using high-throughput sequencing technology will lay theoretic foundation for the early diagnosis and targeted therapy of lung cancer.Materials and MethodsIn this study,a systematic search was conducted in two commonly used public databases,(GEO and ArrayExpress),A total of three high-throughput transcriptome sequencing(RNA-seq)data and two TCGA RNA-seq data(LUAD and LUSC)related to lung cancer were included in this study.Then we re-construct the data analysis pipeline of RNA-seq data according to the current mainstream research recommendations,and re-analyzed the two original data from the three GEO data according to standardized process to obtain the gene counts data of transcriptome.For LUDA and LUSC,transcriptome expressions(counts)data was directly downloaded by using the API from GDC website since original sequencing file were not accessible.Then,five datasets were merged into one big data matrix,differential expression analysis was performed by using DESeq2 and edgeR in R software,and limma package was used to identify and remove the batch effects,vst function from DESeq2 was used to perform normalization correction to obtain normalized gene expression matrices.Weighted gene co-expression network analysis(WGCNA)was used to analyze the gene transcriptome expression profile consisting of 1327 NSCLC tissues and 231normal para-cancerous tissues as controls to construct topology network.We constructed gene modules and searched for gene modules that were closely related to NSCLC,and gene ontology(GO)and KEGG pathway enrichment analysis were performed to explore the functions of the genes in key modules.By combining the results of differential gene expression analysis with WGCNA results,we obtained a batch of differentially expressed genes in module that were closely related to NSCLC.Next,we obtained the expression data of these genes from normalized transcriptome expression data to construct NSCLC predictive model.Finally,ten-fold cross validation combining with machine learning algorithms were used to construct prediction models for NSCLC.ResultsAnalysis of differentially expressed genes were conducted by using the DESeq2and edge R packages and differentially expressed genes were defined as|logFC|>1 and P<0.01.After comparison the results from DESeq2 and edgeR,a total of 2956 genes were up-regulated in NSCLC,including 2124 protein-coding gene,254 lncRNAs and578 gene of other types.There were a total of 1790 gene were down-regulated,consisting of 1565 protein-coding genes,96 lncRNAs and 129 genes other types.A total of 39 gene modules were identified from WGCNA,2 modules of them are strongly correlated with NSCLC(turquoise module:R~2=0.60,Blue module:R~2=-0.79,both P<0.001).Turquoise module is closely related to NSCLC.GO analysis of genes from turquoise module showed that these genes are from nuclear chromosome,chromosome,centrosome,microtubule tissue centers,cytoskeleton,microtubules,microtubule cytoskeletal and other components,and they participated in biological function such as DNA binding,transcription regulation,binding ATP,etc.,involved in biological processes such as proliferation,cytoskeletal and microscopic organization,mitotic cell cycle,nuclear division,sister chromosome separation,DNA metabolism,DNA replication,DNA repair,and cellular response to DNA damage stimuli;KEGG pathway analysis shows the turquoise module gene is mainly enriched in signal pathways such as cell cycle,meiotic meiosis and cell senescence.The differentially expressed genes in the module are mainly involved in the cell cycle,meiosis of oocytes,progesterone-mediated oocyte maturation,Cellular senescence,p53 signaling pathway,homologous recombination and other signaling pathways.This shed new light on molecular mechanisms of the development of NSCLC.The results of WGCNA analysis combined with differentially expressed genes showed that there were 988 differentially expressed genes in the turquoise module which were closely related to NSCLC.Twelve NSCLC prediction models constructed from 988 gene expression matrices from 1558 subjects by using 10-fold cross-validation combined with machine learning algorithms showed good prediction performance in validation dataset,among which models from SVM,XGBoost,C5.0,PLS and AdaBoost algorithms showed higher accuracy than from other algorithms.Although the transparent or semi-transparent models constructed by JRip,PART,and Rpart algorithms have acceptable accuracy in validation dataset,their specificity are lower.From a comprehensive perspective,two black box algorithm models including SVM and XGBoost models are selected as the final model.This study has successfully constructed NSCLC prediction models with accuracies higher than 0.98.ConclusionsIn current study,differential expression analysis and WGCNA of NSCLC related RNA-seq data available in public database were used to screen DEGs and genes that were closely related with NSCLC.Results from GO and KEGG analysis further revealed the underlying mechanisms of NSCLC.Normalized gene expression data was feed to several different machine learning methods and 10-fold cross validation was used to construct high accuracy NSCLC predictive models.Finally,several NSCLC predictive models with accuracy higher than 0.98 in validation group were constructed.Current study has lay the foundation of applying RNA-seq data in the early genetic diagnosis of NSCLC.
Keywords/Search Tags:non-small cell lung cancer, transcriptome, high through-put sequencing, early diagnosis, predictive model, database
PDF Full Text Request
Related items