BackgroundThe prevalence of diabetes mellitus and its complications is gradually increasing,causing far-reaching effects on social and economic burden.A large number of longitudinal electronic medical records(EMR)data have been generated during the treatment of diabetic patients,and there are a lot of important information on diabetes and complication prevention and control in the narratives EMR,such as hypoglycemic events and adverse drug events(ADE).Previously,collecting such unstructured information required timeconsuming and labor-intensive manually retrieval.Natural language processing techniques(NLP)can be used to extract the needed textual information quickly and accurately using computers.However,there are many difficulties in applying NLP to Chinese EMRs,and there is still a lack of research on developing NLP techniques for unstructured key information in Chinese diabetes EMRs.Objective(1)Constructing a high-quality manually annotated corpus based on Chinese diabetes EMRs,including entities,entity modifications,and entity relationships;(2)Extracting useful unstructured information in EMRs based on annotated corpus using dictionary and rule-based methods,and preliminary exploration of deep learning models combined with medical knowledge to extract information on ADE.MethodsEMRs of inpatients with diabetes were collected from hospital endocrinology department,and the data was de-identified and text pre-processed.1.Construction of an annotated corpus for Chinese EMRs.Chinese EMR annotation guidelines were developed based on the linguistic characteristics of Chinese EMR and research content.A web-based annotation platform was developed.The required sentences were selected from the processed medical record text and imported into the annotation platform as the corpus source.Two annotators were trained to perform pre-annotation according to the annotation guidelines.For each round of pre-annotation,the consistency between annotators was calculated.The doubts and inconsistencies that emerged during the pre-annotation process were discussed with medical experts and the annotation guidelines were updated.The corpus is formally constructed after the consistency of the annotators reaches 90%or more.2.Extraction of unstructured information from EMRs using NLP techniques.The Chinese EMR-related dictionaries were established based on professional knowledge,including a dictionary of endocrine drug names,a dictionary of drug dosage forms and delivery methods,and a dictionary of adverse drug reactions in Chinese EMRs.Different rule patterns were designed according to the characteristics of each unstructured data.Dictionary and rule-based approach based on the annotated corpus was used to extract unstructured information from EMRs of diabetes,including acanthosis nigricans,peripheral neuropathy examination results,course of diabetes and complications,hypoglycemic events,adverse drug reactions.In order to further explore the ADE information extraction in EMR,a deep learning model integrating medical knowledge was developed using an annotated corpus as the training set.The BiLSTM-CRF model was developed.Medical knowledge was derived from previously established rules and dictionaries,and added to the model by rule post-processing.A portion of the annotated corpus was used as a training set to help refine the dictionary and rules,and a portion was used as a test set to evaluate the information extraction efficacy of the model.Using an annotated corpus as gold standard,the information extraction performance of NLP system was evaluated by calculating recall,precision,and F1 value.Results1.Construction of an annotated corpus for Chinese EMRs.Entities,entity modifications and entity relationships are included in the annotated corpus.We calculated the inter-annotators agreement(IAA),and the IAA of entity annotation improves from 85.89%to 93.72%,the IAA of relations improves from 65.74%to 92.65%during four rounds of pre-annotation.average IAA of named entity annotation reaches 89.38%and the average IAA of entity relations reaches 82.98%.The annotated corpus has high reliability and can be considered as a high-quality manual annotated corpus.There are 4417 sentences in the constructed corpus.The corpus included a total of 3067 annotated entities,where 509 annotated entities are drugs and 1423 annotated entities are clinical manifestations.The annotated corpus included 1285 positive relationships,1339 negative relationships,and 93 possible relationships.2.Extraction of unstructured information from EMRs using NLP techniques.The dictionary and rule-based approach can accurately extract unstructured information.The recall,precision,and F1 values of the dictionary and rule-based methods for extracting information on target unstructured information in the test set were above 85%.The BiLSTM-CRF model can extract the information of ADE from clinical texts effectively.After integrating medical knowledge,the performance of information extraction for positive relationships was substantially improved,with an F1 value of 70%.In this study,the information extraction efficacy of the dictionary and rule-based methods is better than that of the deep learning model for ADE information,but both methods can effectively extract unstructured information from Chinese EMRs.Conclusions1.In this study,the annotated corpus of Chinese diabetes EMR was established.The corpus construction process and the annotation guidelines will lay the foundation for other related work on Chinese EMR.The annotated corpus plays a key role in the evaluation of the performance of NLP systems and the training of NLP models.2.NLP technology could extract unstructured key information effectively.In this study,dictionary and rule-based methods have demonstrated excellent performance,and the deep learning model has also improved its performance after integrating professional knowledge,indicating that medical knowledge plays an important role in information extraction of medical field.3.Relevant unstructured information in EME can help identify key factors in disease prevention and control,and assist physicians in clinical decision-making and patient management.NLP technology provides reliable support for mining and utilizing unstructured text information in Chinese EMR. |