| Objective: Mining residual value of genetic data by machine learning is bound to be a hot topic in the future.To construct a prediction model based on the combination of machine learning algorithm and genetic data of Ankylosing Spondylitis(AS),and summarize a model that can accurately predict the incidence of AS through the classification and screening of genetic data,so AS to provide a methodological reference for the accurate prediction of AS by genetic data.Methods: A total of 734 sample data were collected in this study,of which 294 were from AS patients in the rheumatology and immunology Department of the First Affiliated Hospital of Anhui Medical University,and 440 were from healthy people in hefei Central Blood bank and physical Examination Center of the First Affiliated Hospital of Anhui Medical University.The study ended up identifying 21 Single Nucleotide polymorphisms,SNP(rs1406846,rs1799864,rs0030906,rs2302685,rs28362459,rs334558,rs3751143,rs4951523,rs6540679,rs7958311,rs10208769,rs10865331,rs14170,rs1729674,rs3811616,rs55785307,rs11616018,rs4958846,rs4964879,rs7300908,rs9652059),these loci come from a variety of genes that may be involved in the inheritance of bone diseases,such as Runx2,And the genetic data of all samples were genotyped by the SNPscan typing technology.Then,by digitally coding SNP(according to the rules of homozygote assignment of 0,2 and heterozygote assignment of 1),seting gender to boolean and age to short integer,this study uses five machine learning algorithms(decision tree,random forest,support vector machine,k-nearest neighbor and logistic regression)to construct AS prediction model combined with genetic data.Accuracy(ACC),Receiver Operator Characteristic Curve(ROC)and Area Under the Curve,AUC,cross-validation and other indicators and methods were used to evaluate the performance of the model.Meanwhile,unsupervised learning algorithm was used to comprehensively evaluate the application characteristics of each model.Python 3.8 and R 4.1.2 were used for data analysis,and the test level was 0.05 on both sides.Results: As for model setting,the parameter tree_number of the RF model is set to800,and the error score of the RF model tends to be stable.As for SVM,In this study,four kernel functions are mainly introduced,including gaussian function,polynomial function,linear function,sigmoid function and KNN model.In this study,the parameter n_neighbors is set to 3 to ensure the robustness of the prediction model and the controllability of the calculation difficulty.In terms of SNP importance screening,through DT model and RF model,we found that rs1406846,rs334558,rs7958311,rs10865331,and rs4958846 played a significant role in the model.May have a particular meaning.In terms of the evaluation of model fitting effect,the AUC value of ROC curve of RF model is the largest,reaching 0.5997,while the AUC value of ROC curve of SVM model with sigmoid function as kernel function is the smallest,which is only 0.5274.The AUC values of the ROC curves of DT model,KNN model and Logistic regression model were 0.5554,0.5911 and 0.5692,respectively.In terms of internal validation of the models,ACC values of all models in the validation set are basically similar.In addition,in order to avoid accidental errors in training,this study also adopts the method of cross-validation of ten folds to evaluate the models.In the average ACC results of cross-validation of ten folds,the ACC value of DT model is only 54.64%.The performance of SVM model using Gaussian kernel in cross validation is basically consistent with that in internal validation,and the level is stable at about60%.Subsequently,the established model was used to predict all sample sets.The results showed that SVM and Logistic regression model had poor prediction effect,and their ACC value was much lower than the other three models.DT,RF and KNN models all had TPR and TNR values close to 90% in all sample sets.In the comprehensive evaluation of the models,in terms of the AUC value of ROC curve,it can be found that RF prediction model has the highest and the fitting effect is good,while sigmiod kernel SVM prediction model has the lowest.In terms of ACC verification,there is no significant difference between all classifier models,and the accuracy rate is close to60%.In terms of cross validation accuracy,the gaussian kernel SVM prediction model has the highest accuracy of 59.68%.In terms of the prediction of all sample sets,RF prediction model has the best prediction effect,followed by DT prediction model and KNN prediction model,while other models do not perform very well in the loopback prediction.In the evaluation of sensitivity and specificity,jorden index was used for evaluation,and the youden index of KNN prediction model was the highest,reaching0.8297.Conclusion: All the models in this study have different performance under their respective evaluation indicators,so relevant workers choose appropriate prediction classification models to carry out relevant work according to the actual needs of the situation,so as to accurately realize the needs of this field.In addition,this research in the field of AS predictive genetic data can’t fully express the AS all genetic prediction(weak generalization ability),there still exist in the field of the AS many of the available data such as image information,DNA methylation,so we will in the next step work data integration,data mining different aspects of the hidden information,Further improve the prediction efficiency of our model. |