| Cardiovascular disease is a major public health problem that endangers human health.It has the characteristics of occult onset,long latency and difficult to cure after onset.Relevant studies have shown that the occurrence of cardiovascular disease is closely related to its risk factors.By early detection and prevention of these risk factors,and the establishment of appropriate disease risk assessment models,cardiovascular disease can be effectively prevented and controlled.There are endless studies on cardiovascular disease risk assessment at home and abroad,but most of the existing studies are based on questionnaire data,literature data and established risk factors,which can not fully grasp the risk factors of cardiovascular disease.Most of them are based on medical statistical methods,which have some limitations.In recent years,with the continuous advancement of medical information construction,medical information systems such as electronic medical records(EMR)have developed rapidly.EMR data not only contains the patient’s test indicators data,but also contains a large number of hidden and valuable information,providing a new data source selection for cardiovascular disease risk assessment research.In addition,the continuous improvement and development of data mining algorithms have gradually increased their application in disease risk assessment,making up for the shortcomings of traditional statistical method model construction.Therefore,the rational use of data mining algorithms to explore the potential rules and patterns of EMR data is of great value for the early prevention and treatment of cardiovascular diseases.In this paper,the risk assessment model of cardiovascular disease is deeply studied by EMR data mining technology,taking hypertensive patients as the research object.The main contents and achievements of this paper are as follows:(1)A series of pretreatment operations were carried out to deal with theinconsistency and incompleteness of EMR data sets of hypertension.In order to provide clean and effective data for data analysis algorithm,this preprocessing process has certain reference value for EMR data preprocessing of other diseases.(2)For the attribute redundancy and multi-collinearity problem in the EMR data set of hypertension,the risk component screening operation was performed by principal component analysis in statistics.Twenty of the main risk factors for hypertension were screened from more than 50 test items,which effectively reduced the complexity of the model construction.(3)Aiming at the problems of statistical methods in the model construction,the method of cardiovascular disease risk assessment model construction based on decision tree C5.0 algorithm was adopted.In the process of model building,Boosting technology is adopted to improve the robustness of the model,and ten fold cross validation is used to improve the reliability of the model.At the same time,in order to avoid over-fitting of the model,pruning operation of decision tree is also carried out.Compared with before pruning,the prediction accuracy of the model increased from65.18% to 73.21%. |