Characteristic Gene Screening And Prediction Of Gastric Cancer

Posted on:2024-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:R X Ma

Full Text:PDF

GTID:2530307058980809

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

As a highly heterogeneous tumor,gastric cancer originates from the gastric mucosa epithelium.In recent years,because of its high incidence,many scholars have been interested in the study of the pathogenesis and prognosis of gastric cancer.With the development of the research on the molecular mechanism of gastric cancer,the early diagnosis and individualized treatment of gastric cancer are also exposed in the field of vision.In this thesis,the key genes of gastric cancer were screened and diagnosed according to the results of gastric cancer samples.The specific tasks are as follows:First,426 clinical data and 59,427 genes related to gene diagnosis of gastric cancer were Data pre-processing in TCGA database,finally,the gene ID matrix is obtained.Secondly,in the process of key gene screening,we first use the difference analysis based on multiple transformation to do the preliminary screening,and then use the random forest,LASSO and rf-LASSO methods to build the models of gene screening respectively,in order to reduce the randomness of the results of gene screening and to ensure the functional significance of the results in biology,30 key genes were selected by model training and parameter optimization,in this thesis,the Hamming method was used to evaluate the stability,the KEGG enrichment analysis was used to compare the number of significant pathways of the selected genes,and to explore the biological significance of each group of genes,the Hamming distance was 0.622 and there were 9 significantly enriched gene pathways.The 30 key genes selected from the random forest were the final gene screening results,they are RPLP0P2,INHBA,ESM1,GABRD,CST1,BMP8 A,CKMT2,GPM6 B,APOC1,CMTM5,VEGFD,ADAMTS12,CLEC3 B,MMP11,FOXS1,SDS,ADAM12,AC026369.2,IL11,Trem2,Linc02086,DMBX1,WNT2,OTX1,CTHRC1,HOXC12,LINC01050,DUXAP10,EPHB2,CXCL9.Finally,in order to further verify the effectiveness of the selected key genes,this thesis carried out diagnostic prediction verification.Representative naive Bayes classification algorithm,LDA discriminant algorithm,XGboost algorithm and single hidden layer neural network algorithm were selected.0 and 1 were defined as normal samples and cancer samples respectively.The original data set was divided into training set and test set by the cross data partitioning method for diagnosis and prediction analysis,confusion matrix was calculated and ROC curve was drawn.According to the evaluation index,the prediction results of the four models were all excellent,which further verified the excellence of the key gene screening results.Finally,XGboost algorithm with a total accuracy of 0.953 ACC,a total recall rate of 0.941 REC,a REC1 of 0.991,a F1 value of 0.842 and a AUC value of 0.876 was selected as the main prediction model of this thesis.Then,the contribution of each gene in the prediction process was sorted.It is known that the contribution rate of gene ESM1 is as high as 70%,and the exploration and analysis of gene cards and GO enrichment analysis of this gene provide a certain experimental basis for the subsequent research of scholars.According to the screening model and the prediction model,the ensemble learning algorithm has obvious advantages in the biostatistics,and it can be used as a direction for further research.

Keywords/Search Tags:

random forest, XGboost algorithm, Hamming distance, KEGG enrichment analysis

PDF Full Text Request

Related items

1	Microarray Data Mining And Bioinformatics Analysis Based On Tuberculosis Gene Chip Data
2	An Alignment Algorithm For DNA Short Reads Based On The Hamming Distance
3	Hamming Distance And Lee Distance Of Linear Negacyclic And Cyclic Codes Over Z_2Î±
4	Qualitative Logging Identification Of Water-flooded Zones In F Block Based On Random Forest
5	Research On The Annlication Of Simulated Annealing And Integration Algorithm In Risk Control Area
6	Research On Zoning Of Landslide Susceptibility Based On XGBoost
7	Differential Gene Screening And Bioinformatics Analysis Of Pulmonary Sarcoidosis Based On Random Forest Algorithm
8	Screening And Bioinformatics Analysis Of MicroRNA Microarray For Lung Adenocarcinoma
9	Construction Of Coding Caching Schemes Based On Hamming Distance
10	Multi-Factor Quantitative Trading Strategy Design Based On Random Forest Algorithm