Nowadays,great changes have taken place in the national life style.But at the same time,the aging population is also accelerating,which leads to the increasing incidence of cardiovascular disease.Therefore,if the key characteristics that can predict the mortality risk of patients with cardiovascular diseases can be found through data mining and data modeling,and an efficient prediction model can be established,it will provide a basis for the selection of medical diagnosis and subsequent treatment methods.The data came from UCI ML Repository information database.After data cleaning,a total of 299 cases of data were taken as sample data for subsequent research.The data included 11 characteristic factors such as gender,age,smoking status,diabetes status and hypertension status,among which the survival status and death status of patients during the observation period were used as response variables.There are two purpose of the research:(1)to determine the important factors for different age stages,(2)to determine the best prediction model for different age stages.In the second chapter,data source description,descriptive analysis and variable test were carried out for cardiovascular disease data,and three-fold cross-validation samples were constructed.In the third chapter,decision tree and bagging were used to determine the importance characteristics of patients with cardiovascular disease at different ages.A three-fold cross-validation method was used to construct a data model for the training set.Then the model was used to predict the data of the test set,and the prediction effects of decision tree and bagging methods were compared.In chapter 4,Logistic regression method is used to establish the model,to determine the important characteristic,and then the established model is used to predict the test set,and the accuracy of the model is obtained.In the fifth chapter,the linear discriminant method is used to test the results obtained above.In Chapter 6,by comparing the above models,important features are identified for predicting mortality risk in patients with CVD at different ages,and the best prediction model was obtained by comparing the prediction accuracy of the test sets of the above models,providing reference for medical diagnosis.The analysis results obtained in this thesis are as follows: for patients aged 40-49 years,the ejection fraction should be paid more attention to,and the logistic regression model was selected for the highest prediction accuracy;For patients aged50-59 years,the three variables of ejection fraction,serum creatinine,and platelets need to be focused,and the Fisher linear test was selected to predict the highest accuracy.For patients aged 60-79 years,the two variables of ejection fraction and serum creatinine should be paid more attention,and the logistic regression model was selected for the highest prediction accuracy.Patients over the age of 79 are at high risk of cardiovascular disease. |