Background and purpose:Prostate cancer is a serious threat to men’s health,the current screening methods for prostate cancer include digital rectal examination,prostate specific antigen,imaging examination,etc.,but there are some limitations such as insufficient accuracy,high cost and so on.As the gold standard for diagnosis,prostate biopsy is an invasive operation,accompanied by the risk of infection and other complications,therefore,clinicians need certain means to judge the indication.As one of the components of artificial intelligence,machine learning has been playing an important role in various prediction scenarios and has the potential to predict the results of prostate biopsy.However,with the upgrading of algorithms,the complexity increases correspondingly,leading to the occurrence of "black-box effect".The decision-making process of machine learning is often opaque,and the results lack explanation,which increases people’s distrust to the model.In the diagnosis of suspected prostate cancer of patients,the judgment of the model may have a significant impact on them.In this case,the further application of machine learning is limited.Therefore,the interpretability of machine learning models is particularly important.In this study,we built a machine learning model to predict the diagnostic results of prostate biopsy by collecting results of blood sample examinations and other indicators,and performed interpretable analysis of the model to obtain a trustworthy prediction of prostate biopsy.Materials and methods:Data of patients who underwent prostate biopsy in our hospital from 2015 to 2021 were collected,and all of them underwent routine venous blood examinations and others.The age,blood pressure status,and results from blood examinations such as blood routine test,liver function test,biochemical index,serum lipids and PSA were collected from patients’ medical records,then they were summarized and tabulated.The missing values were filled by model fitting,and the features were screened using the mutual information method.After that,a supervised learning scheme was used to train and construct the machine learning model by using ensemble algorithms such as random forest,XGBoost,and non-ensemble algorithms such as support vector machine,logistic regression,combined with ten-fold cross-validation,and the predictive performance of the models is evaluated by using confusion matrix and area under the curve.Finally,the model interpretation is carried out by SHAP analysis to clarify the importance of each feature of the model and the decision-making basis.Results:A total of four prediction models were constructed,all of which could identify the patients with prostate cancer in a good level,but compared with ensemble algorithms,the non-ensemble algorithms had higher false positive rates,support vector machine was the most serious,and the overall accuracy was the lowest,which was 73%.Among the models using ensemble algorithms,XGBoost had the best prediction accuracy,with accuracy and AUC of 80%and 0.84 respectively.SHAP analysis was used to explain the decision weight and prediction direction of each feature in XGBoost model visually,indicating that PSA and age were still the main decision factors in machine learning model prediction,and uric acid,triglyceride and apolipoprotein A were also important reference features of the model.Conclusion:Machine learning can predict the diagnosis of prostate biopsy using data from physical and blood examinations.The prediction accuracy of the model constructed by XGBoost in the ensemble algorithm is higher,which can pre-evaluate the indications of prostate biopsy.The decision of the model can be visually explained through SHAP analysis,showing the importance and role of age,PSA,albumin-globulin ratio,CRP,uric acid,triglyceride,apolipoprotein A and other indicators in model prediction,providing reference for clinicians.The establishment and prediction of the model depends on the data set,and the data from a single center may affect its generalization ability. |