| The International Classification of Diseases(ICD)is a standard for coding diseases maintained by the World Health Organization.It is widely used to facilitate medical reimbursement and to report on patients’ health.According to relevant national regulations,the home page of electronic medical records of all patients in hospitals should be coded according to ICD standards.Because of the low efficiency and low accuracy of manual coding,the automation of ICD coding has always been the focus of research.The research of ICD automatic coding is mainly based on the text of medical record and the text of discharge diagnosis.The research content of this paper belongs to the latter.At present,there are mainly two kinds of modeling methods for ICD auto-coding problem based on discharge diagnosis text: text similarity-based method and text classification method.There are some outstanding problems in their practical application: the established ICD coding label system is incomplete,and it is unable to code for all diseases.The distribution of ICD sample labels is unbalanced and the coding accuracy of rare diseases is low.The overall performance of the model is poor due to the various writing and diagnosis styles of doctors.To solve the above problems,this paper proposes a FPK-Seq2Seq(Fusion Priori Knowledge Based on Seq2Seq)model,with specific contributions as follows:(1)The ICD auto-coding problem is modeled as a task to "translate" the doctor’s diagnosis into the standard disease diagnosis name,which effectively utilizes the ICD standard disease diagnosis information and supports the encoding of all disease types.Experimental results show that the proposed modeling method has a significant improvement in each evaluation index compared with the traditional modeling method.(2)In view of the uncontrollable generation process of basic Seq2 Seq model,FPK-Seq2 Seq model is proposed under this modeling method.In the decoding stage,the model actively learns and uses the prior knowledge to guide the generation process of the target sequence to generate a more reasonable text.An algorithm based on semantic similarity is used in the output layer of the model to improve the accuracy of ICD coding.The experimental results show that the FPK-Seq2 Seq model can effectively alleviate the impact of data imbalance on the coding accuracy of rare diseases,and the mean macro F1 reaches 0.7388.Besides,it also achieves good results in other indicators.In addition,the FPK-Seq2 Seq model has the reasoning ability to correctly encode some ICD coding tags that have not appeared in the training set.(3)The ICD automatic coding system is designed and implemented,including data preprocessing,model management and ICD automatic coding modules.The system can update the model regularly to continuously improve the performance of the model.To sum up,this thesis adopts a new modeling idea based on "translation" and proposes a FRK-Seq2 Seq model that integrates prior knowledge for the diagnosis of ICD automatic coding problem.The experimental results show that the model has good performance,generalization ability and scalability,and the ICD automatic coding system is designed and implemented according to this model. |