Font Size: a A A

Risk Prediction Model Of Nasopharyngeal Carcinoma Based On Machine Learning

Posted on:2024-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:R F LuFull Text:PDF
GTID:2544307166453884Subject:Public health
Abstract/Summary:PDF Full Text Request
Objective: Nasopharyngeal carcinoma(NPC)is a common malignancy in the southern region of China.Current screening for NPC is primarily by EBV serology,endoscopy and imaging.Each of these screening methods has its place,but is economically burdensome in large population screening.In this study,machine learning algorithms were used to train NPC risk prediction models with real patient electronic medical record data,with the aim of identifying high-risk people for NPC and providing a basis for prevention and treatment efforts.Methods: To retrospective analysis the data of 2805 patients attending the Affiliated Hospital of Guilin Medical College from January 2018 to June 2021,including 1357 in the NPC group and 1448 in the non-NPC group.All the features in the patient treatment are collected and the feature selection is using XGBoost algorithm and patient graph respectively.The data were divided into training set,validation set and test set in the ratio of 6:2:2.The risk prediction models were trained using XGBoost algorithm for different feature sets and comparing with Random Forest(RF),Support Vector Machine(SVM)and KNearest Neighbor(KNN)algorithms.The model performance was evaluated by recall,precision,accuracy,specificity,area under the working characteristic curve(AUC),and reliability curve.Finally,the 95% confidence interval of each evaluation index of the model was calculated using Bootstrap method.Results: Among the 1357 NPC patients,386(28.45%)were female,971(71.55%)were male,489(36.04%)were 30-50 years old,785(57.85%)were51-70 years old,and 83(6.12%)were older than 70 years old.Risk prediction models with different numbers of features were trained separately using the XGBoost algorithm.The results showed that the model containing 100 features such as comorbid diseases,symptoms,clinical observations,laboratory tests,medical history,operations,and other risk factors performed best with recall of0.939,precision of 0.921,AUC value of 0.934 and accuracy of 0.934.The patient graph was analyzed to obtain 51 features highly correlated with NPC.A feature set containing 5 diseases,1 history,3 clinical observations and 11 symptoms was obtained after screening and integration.XGBoost algorithm is used to train the model with integrated feature set.The model still has good performance,with recall of 0.797,precision of 0.945,AUC value of 0.878 and accuracy of 0.884.RF,SVM and KNN algorithms were respectively used to train the optimal feature set and integration model,and compared with XGBoost algorithm.In the optimal feature set,the recall of RF algorithm is the highest(0.969),the accuracy of XGBoost algorithm is the highest(0.921),and KNN algorithm has the lowest performance.In the integrated feature set,all the models showed good performance.The SVM algorithm had the highest recall and AUC,with 0.816 and 0.833.The XGBoost algorithm had the highest precision and accuracy,with 0.945 and 0.884.For the identification of NPC patients,the most important disease is sinus inflammation,followed by middle ear disease.In the symptoms of the most important is the head symptoms,followed by throat symptoms,neck mass,eye symptoms,nasal symptoms and ear symptoms.Conclusions: In this study,XGBoost algorithm and patient graph were used for feature selection of electronic medical record data,and the optimal feature collection and integrated feature set were obtained after screening.In this study,XGBoost algorithm was used to construct risk prediction models on different quantitative features.Among 163 features,the optimal selection of 100 features including condition,symptoms,observations,laboratory tests,medical history,procedures and other risk factors was the best in predicting the risk of NPC.The model can be used to identify people at high risk in hospital Settings.In this study,Neo4 j database was used to construct patient graph,innovatively.The results showed that high-risk features such as sinus inflammation,middle ear disease,head symptoms,throat symptoms,neck masses,eye symptoms,nose symptoms,ear symptoms,etc.,could be screened out by analysis of the graph.These variables can be easily obtained through questionnaires and other method.The model can be applied in the implementation of early screening for NPC in the general population.
Keywords/Search Tags:Nasopharyngeal carcinoma, XGBoost, Neo4j, Feature selection, Disease risk prediction
PDF Full Text Request
Related items