Font Size: a A A

Analysis Of The Effect Of Various Machine Learning Algorithms In Predicting Hospitalization Cost Of Respiratory Diseases

Posted on:2021-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:C SunFull Text:PDF
GTID:2404330626459012Subject:Public health
Abstract/Summary:PDF Full Text Request
Objective:In view of the hospitalization expense data of respiratory diseases,three methods of random forest,support vector machine and neural network are used to establish models,to explore the prediction performance of three mainstream machine learning methods on hospitalization expense under different sample sizes,to explore the feasibility of applying machine learning methods in hospitalization expense prediction,and to provide machine learning method support for hospitalization expense prediction.Methods:The hospitalization data of respiratory diseases in Shenzhen in 2013 were divided into mutually exclusive training set and test set.Random forest,support vector machine,neural network model super parameter optimization and model training are respectively carried out on the test set,and then the model that has completed the training is used to predict the output of the dependent variable on the test set and compare with the real value of the dependent variable.The area under ROC curve,confusion matrix,precision,precision,recall and F1 scores were used to evaluate and compare the three machine learning models.This experimental process is repeated on all the subsets of different sample sizes mentioned above and on the complete data set,so as to obtain the performance changes of the three machine learning methods under different sample sizes.Results:The areas under ROC curves of random forest,support vector machine and neural network models are 0.911,0.875 and 0.796 respectively under 500 sample sizes.The accuracy is 64.80%,63.20% and 54.40% respectively.F1 scores were 63.52,62.36 and 53.29 respectively.Random forest and support vector machine perform better than neural network,but random forest model takes the longest time to train and predict.Under the sample size of 2000,the areas under ROC curves of the three models are 0.944,0.915 and 0.923 respectively.The accuracy is 76.40%,71.80% and 74.80% respectively.F1 scores were 75.79,71.33 and 74.22 respectively.At this time,the difference between the prediction performance of the three methods is significantly reduced,the indexes of the neural network model are significantly improved,and the random forest model still takes the longest time.The areas under ROC curves of the three models are 0.945,0.934 and 0.934 respectively with a sample size of 10,000.The accuracy is 76.60%,74.44% and 74.44% respectively.F1 scores were 76.78,74.23 and 74.39 respectively.The gap among the three methods is further narrowed,and the training time of random forest and support vector machine is obviously increased.Under 124980 sample size,the areas under ROC curves of the three models are 0.942,0.939 and 0.953 respectively.The accuracy is 76.80%,74.88% and 77.51% respectively.F1 scores were 77.10,74.81 and 77.73 respectively.The neural network model is optimal in terms of training,prediction time and prediction performance index.The prediction performance of support vector machine and random forest model is still good,but the training time is 4.4 times and 44.8 times that of neural network model respectively.Generally speaking,the random forest performs well and is very stable under different sample sizes.Support vector machine performs well in small samples.Although the prediction is still reliable in large samples,the computation is too large and takes too long.The prediction performance of neural network is much lower than the other two methods when the sample size is insufficient,but with the increase of sample size,the prediction performance improves rapidly and always takes less time.Conclusion:1.Random forest,support vector machine and neural network are all feasible in the application of hospitalization cost prediction for respiratory diseases.2.With the increase of sample size: the random forest prediction ability is stable and excellent,and the calculation time is acceptable;The prediction ability of support vector machine is stable,but less than that of random forest.The prediction ability of neural network is improved obviously,and the computation time is minimized.3.The superparameter optimization can significantly improve the prediction performance of support vector machines and neural network models,but it is of limited help to the random forest model.
Keywords/Search Tags:Machine learning, Support vector machine, Neural network, Random forest, Hospitalization cost
PDF Full Text Request
Related items