Coronary heart disease is an important disease that endangers human health.The electronic medical record of patients contains a lot of description information of risk factors such as hypertension and diabetes.Accurate extraction of these descriptions has a great significance for clinical research and auxiliary clinical diagnosis.At present,a lot of work has been carried out on the extraction of risk factors for coronary heart disease based on English electronic medical records,and the research on the extraction of Chinese electronic medical records is relatively rare.Therefore,it is necessary to study the risk factors of coronary heart disease in Chinese electronic medical records.This paper comprehensively uses a variety of techniques of natural language processing,to study the extraction method of risk factors for coronary heart disease based on the construction of a corpus,and provide reference for clinical experiments.The main contributions of this article can be summarized as follows:(1)Developed a labeling guide for the coronary heart disease risk factor corpus for Chinese electronic medical records,and completed the construction of the corpus.Based on the pretreatment summary of 500 patients with coronary heart disease provided by a top three hospital in Xinjiang,the labeling guidelines and risk factors were developed with reference to the corpus of coronary heart disease risk factors published by I2B2 of the American Center for Clinical Informatics in 2014.Corpus annotation tool;pre-labeling and formal labeling by two clinicians.After three rounds of pre-labeling and one round of formal labeling,the labeling consistency IAA reached 0.95,and the results showed that the labeling was reliable.(2)A hybrid method for the extraction of risk factors for coronary heart disease was proposed.According to the imbalance problem of the risk factor identification data in the constructed corpus,we implemented extraction by using the rule-based and machine learning methods.For the risk factors with more distribution of identification data,the conditional random field CRF and the bidirectional long-term memory neural network Bi-LSTM are combined to extract the model;for the less distribution of the identification data,the rule-based method is used.Packet decimation helps toovercome the shortcomings of poor generalization ability and over-fitting of the model caused by the imbalance of description information.Experiments show that the F-value of the hybrid extraction method is 0.882,which is higher than the single method and single-packet extraction result.(3)In order to further improve the accuracy rate,an improved multi-task Bi-LSTM-CRF extraction method is proposed for the risk factors with more identification data.The word vector is used to construct the word vector,and the extraction task is combined with the word segmentation task.In the extraction process,the word boundary information obtained in the word segmentation is shared,and more feature sets are provided for the extraction.The risk factors were extracted using the Bi-LSTM-CRF model for both tasks.The experimental results show that the F value is 0.885,which is much better than the 0.865 extraction result using the Bi-LSTM-CRF model alone. |