| Coronary heart disease(CHD for short)is one of the common cardiovascular diseases that cause a large number of human deaths every year.The incidence of CHD is increasing year by year,and the trend of development is beginning to appear in the younger age.Patients with coronary heart disease can be found in a timely manner in the early stage,and receive regular and effective treatment,thus reducing the fatality rate.Therefore,the research on coronary heart disease is undoubtedly an important subject for national public health undertakings.Although coronary angiography is an effective means for the diagnosis of coronary heart disease,it is difficult to be popularized in most areas of China due to its disadvantages such as expensive diagnosis cost,invasiveness and high requirement of operation technique.It is of great significance for social and economic development to study how to identify individuals at high risk of CHD without injury and trauma in the early stage.The mature machine learning algorithm developed in recent years has brought a new possibility for the early diagnosis of CHD.Firstly,this paper introduces the related research of domestic and foreign scholars in detail.They applied machine learning algorithm to the early diagnosis of coronary heart disease and achieved valuable results.Then the single model including Logistic regression,k-nearest neighbor,Naive Bayes and integrated model Adaboost are introduced,and the principle of these algorithms and the advantages and disadvantages of each algorithm are analyzed in detail.Finally,the evaluation indexes of prediction classification are introduced in detail.In this paper,descriptive statistical analysis and data preprocessing were performed on CHD datasets.Then,recursive feature elimination method,two-sample K-S test and random forest are used to select the features that appear twice or more as the final selected features,and the important features are visualized and analyzed.Finally,three single models and one integrated model were used for modeling analysis,and the classification performance of the four models was compared with the performance evaluation index.The integrated model Ada Boost predicted the best classification performance,with an accuracy rate of 91.99%,a recall rate of 85.59%,and an accuracy rate of 82.24%.F1 value is 0.8867 and AUC is 0.9117.Because the single model and the integrated model do not have the ability of feature learning,it is impossible to obtain feature cross information.Boosting+LR,a combinative model with Stacking idea,is applied to the prediction of CHD in order to further improve the prediction performance of CHD.Firstly,a Boosting model is used to cross-learn the characteristics of CHD risk factors to get the combined features,and then the combined features are encoded as new discrete feature vectors and input to the Logistic model for prediction.The prediction performance of GBDT+LR was higher than that of XGBoost+LR and Adaboost.The accuracy rate was 93.58%,the recall rate was 86.65%,the accuracy rate was 84.62%,the F1 value was 0.8998 and the AUC was 0.9247. |