| With the rapid development of medical research,the number of medical texts has increased dramatically.Clinical practice,medical literature,medical textbooks,electronic medical records,and other medical texts contain a wealth of medical knowledge.With the help of biomedical information extraction technology,medical entities and relations are extracted from medical texts to construct the medical knowledge graph,which can help to develop the medical QA,auxiliary diagnosis,and other tasks.At present,medical text entities and relations extraction research mainly focuses on English medical literature and electronic medical records,and there are few public evaluations and corpora of Chinese medical texts.Also,existing extraction research is mainly based on pipeline methods,with ignoring the relationship between subtasks.In summary,this paper studies the construction of the Chinese medical information extraction datasets and the joint extraction models of entities and relations.The main research work includes:(1)Construction and analysis of Chinese medical entity and relation extraction datasets.This paper collects texts from clinical practices,medical textbooks,electronic medical records,and so on.It refers to authoritative medical standard terminology sets such as ICD-10,ATC,MESH,medical insurance directory,and the entity and relation annotation system and norms.It also establishes the entity and relation annotation rules in line with the corpus’ s characteristics,completes the pre-annotation and manual annotation with the annotation tool.The results of annotation consistency calculation show the reliability of datasets,and the Chinese Medical Information Extraction(CMe IE)dataset contains 28,008 sentences,85,282 triples,11 entities,and 44 relations,while the Chinese Electronic Medical Records Dataset of Diabetes and Stroke(CEMRDS)contains 6,192 sentences,18,846 triples,7 entities,and 14 relations.(2)Research and improvement of Chinese medical entity and relation extraction algorithms.This paper aims at the multi-relation extraction problem,and the entity overlap problem in the multi-relation extraction in the CMe IE dataset and CEMRDS dataset.It proposes two entity and relation joint extraction algorithms: the subjectbased Cascade binary tagging framework of Conditional Layer Normalization(CasCLN)model and the biaffine attention-based Cascade binary tagging and Multi-head Selection(CAMS)model.The subject-based Cas-CLN model utilizes the pre-training model to encode the input sentences,and uses the subject tagger to recognize subject entities.It then applies the conditional level normalization method to integrate the subject embedding into the sentence-coded information and extracts the possible objects and relations with relation-specific object taggers.The biaffine attention-based CAMS model also utilizes the pre-training model to extract the input sentence features,and employs the binary tagging framework to complete named entity recognition.Then it adds the soft label embedding to transmit information between entity recognition and relation extraction,and applies the biaffine attention to improve the multi-head selection module to extract the possible semantic relations between entity pairs.The CAMS-Bia-Syn model has achieved competitive results on the CMe IE dataset,with the test set F1 value of 60.57%.The Cas-CLN Model outperforms state-of-the-art models on the CEMRDS dataset,with the test set F1 value of 70.94%.By comparing the two datasets’ different performances,considering that there is a 3.14 times difference in the number of semantic relations in the two datasets,the experimental results show that the Cas-CLN model may be more suitable for the datasets with a small number of relation types.The CAMS-Bia-Syn model may be more good at situations with a large number of relation types. |