Font Size: a A A

Identification Of Feature Genes Between Adenocarcinoma And Squamous Cell Carcinoma Of Lung Cancer And Classification Of NSCLC Using SAM-GSR Algorithms

Posted on:2019-06-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:1364330548456704Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Lung cancer has become the most malignant tumor with the highest morbidity and mortality in China,among which Non-small-cell lung carcinoma(NSCLC)accounts for more than 85% of lung cancer.Squamous cell carcinoma(Squamous,cell,carcinomas,SCC)and adenocarcinoma(Adenocarcinomas,AC)are the two most important pathological types of NSCLCs.The pathogenesis and growth process of these two kinds of lung cancer are very different,and different therapies should be adopted in the clinical treatment.However,due to the lack of researches on the molecular mechanism of SCC and AC,people usually take the same treatment measures for the two kinds of lung cancer,which leads to poor outcome.Therefore,the present study aimed to identify SCC and AC subtypes associated feature genes using bioinformatics analysis method,and to analyze the different pathogenic mechanisms of these two subtypes through performing the gene function analysis and predicting the upstream regulatory factors.In addition,due to there exsited some disadvantages of all reported algorithms for feature selection,especially the accuracy of those algorithms for single gene selection is not high.Thearefore,in the present syudy,we also evaluates the feasibility of SAM-GSR algorithm in feature selection of SCC and AC subtypes associated genes and classifying different stages of each subtypes.First,four qualifiedlung cancer expression profiles were obtained from the NCBI GEO database,and the datasets were managed by Meta QC package for quality control and differentially expressed genes(DEGs)were analyzed by Meta DE package.Rank function and cor.test function were used to test the correlation and consistency of the significant difference between the datasets.Second,the DEGs are enriched and analyzed for Gene Ontology(GO)function and Kyoto,Encyclopedia,Genes,and,Genomes(KEGG)pathway respectively by DAVID online analysis tools.Thirdly,integrating the human protein-protein interaction of three comprehensive database-STRING,Bio GRID and HPRD to construct the protein-protein interaction(PPI network by using Cytoscape3.3).The topology of PPI network was analyzed based onnode degree(degree),closeness centrality(CC),between centrality(BC).Thennodes based on each parameter are descending sorting,top100 nodes for each parameter were selected.The overlapped Top100 genesbased on the three parameters were subjected to pathway and transcription factor(TF)enrichment analysis.Fourthly,mi R2 Disease database was searched for predicting mi RNAs directly associated with AC and SCC.The mi RNA target genes were collected and mapped to important genes previously screened.The disease mi RNA and TF regulatory networks are constructed by combining the important genes associated with TFs obtained in the previous step.Finally,using GSE43580 as the training data set,the other three sets of expression data were used to validate the data set,and the most important genes were screened by random forest(RF),and the classification model was constructed by using support vector machine(SVM).Finally,the data from GSE50081 dataset and the RNA-seq data from TCGA database were analyzed using SAM-GSR algorithm.We first used the GSE50081 data as the training set and the RNA-seq data as thetest set.Then,we swapped them and applied SAM-GSR again to analyze the performance of SAM-GSR on NSCLC data for stage segmentations and feature selction.The results are listed as follows:(1)Based on Meta QC package,a total of 1201 consistent DEGs were obtained,including 661 down-regulated DEGs and 540 up-regulated DEGs.(2)GO function and KEGG pathway enrichment analysis showed that those DEGs were significantly enriched in the GO functions such as cell adhesion,calcium binding,biological adhesion and epidermal cell differentiation.In addition,most DEGs were significantly enriched in several KEGG pathways including cell adhesion molecules,complement and coagulation cascade,glycolysis pathway andmelanogenesis.(3)In the integrated interaction network,the 869 proteinpairs mapped to 529 genes were obtained through integrating the relationshipbetween genes of the three databases,which included 135 downregulated genesand 394 up regulated genes.The 39 important genes were obtainedvia intersecting Top100 genesbased on Degree,BC,and CC values.Among them,the up-regulated ELAVL1 and MYC degree distribution,BC and CC values were ranked first and second,respectively.A total of 7 significantly related KEGG pathways and 8 significantly related TFs were searched for the 39 genes.Those 39 genes were closely asscoaited with cell cycle,P53 and TGF-? signaling pathways,and eight upstream TFs(eg.NFY,EGR1 and NKX2-2)of DEGs were predicted.(4)A total of 5 mi RNAs(hsa-mi R-200 b,hsa-mi R-205,hsa-mi R-18 a,hsa-mi R-486 and hsa-let-7a)signific natlyassociated with lung cancer.The mi RNA-target-TF network contained 32 nodes,including 4 mi RNAs,8 TFs,3 significant down-regulated and 17 up-regulated genes.In addition,among the 4 mi RNAs,mi R-200 b had most targeted genes(eg.,ERRFI1,PPARGC1 A and MAPK6),and EZH2,MAPK6,MYC,SUV39H1 and TK1 were targerted by let-7a.(5)The optimal combination of genes was extracted using RF algorithm,which consisted of 5 genes: Synuclein Alpha(SNCA),Interferon Gamma Inducible Protein 16(IFI16),MAPK6,ERRFI1 and Stratifin(SFN).The GSE43580 dataset which contained the most samples was considered as training data set to construct disease subtype discriminant classifier by 5 optimal combinations of genes based on the previous step screened.The classifier could accurately classify 133 samples(75 AC and 58 SCC),and the accuracy rate was 88.67%.The three remaining dataset verification showed that the classification model was reproducible and portability.(6)The performance of SAM-GSR was comparable with Lasso,Penalized SVM,DEGs + SVM and Radviz + SVM feature selection algorithms with 0.609 for BCM value and 0.63 for AUPR,which were ranked first and second,respectivelly.Conclusions: 1.A seires of keratin family members(eg.KRT1,KRT4,KRT5,KRT6 B,KRT15,KRT16,KRTAP19-1,KRT23,KRT75 and KRT33A)were differentially expressed across SCC and AC samples,and the expression of those DEGs were higher in SCC samples than in AC samples.2.CDK1,CCND1,SFN and CHEK2 may contribute to the defferent speed of SCC and AC cells proliferation through regulating cell cycle and p53 signaling pathways.In addition,another two DEGs SMAD7 and MYC were involved in TGF-? signaling pathway,indicating that there were a difference in TGF-? signaling pathway between SCC and AC subtypes.3.The overexpressed mi R200 b might play an important role on the development of lung cancer through downregulating ERRFI1 and PPARGC1 A targets,which might be more associated with the development of SCC.In addition,the downregulated let-7a could contribute to the tumorigenesis of NSCLC via regulating EZH2,SUV39H1,TK1 and MYC,and the expreesion of those downstream targets were higher in SCC ples than in AC samples,showing those molecular mechanisms might also be more involved with the development of SCC.4.A total of 5 genes were filtered out by the machine learning method,such as SNCA,IFI16,MAPK6,ERRFI1 and SFN.The classifier was constructed based on the expression characteristics of the 5 genes in the AC and SCC sample,which can accurately identify the different subtypes of lung cancer samples to determine the lung cancer subtypes.The study was important for the accuate treatment and prevention for the subtypes of AC and SCC.5.SAM-GSR can carry out feature selection for SCC and AC,and it was comparable with other algorithms for stage segmentations of each subtypes.Given the SAM-GSR had the limitations of completeness pathway information,model parsimony and without taking the pathway topology knowledge into consideration,thus SAM-GSR algorithms will be modified correspondingly and carry out feature selection reevaluated in our future work,which may propel the development of pathway based on feature selection algorithms.
Keywords/Search Tags:Non small cell lung cancer, meta analysis, differentially expressed genes, functional and pathway enrichment, random forest algorithms
PDF Full Text Request
Related items