With the popularity of electronic medical records,ICD coding has become a hot issue in the field of natural language processing.This thesis mainly studies the content in electronic medical records,and proposes SHAN model for ICD coding task and RESS model for sentence sampling task.At present,the existing ICD coding model is generally a black-box network,which cannot provide the corresponding reasons after giving the classification results;moreover,the existing model almost only uses a single part of the data in the medical records,ignoring the other information which is helpful for the classification.In order to solve the above problems,this thesis proposes SHAN model,which combines the information of disease description and doctor’s written diagnosis in medical records.Taking doctor’s written diagnosis as the basis of attention allocation,the hierarchical structure is used to complete the ICD classification task with higher performance,at the same time,more attention weights would be allocated to the sentences in the disease description which are more relevant to the specific diagnosis.And the attention allocation is used to show the reason of the classification.In the contrast experiment,the SHAN model shows excellent performance on the MIMIC dataset and the Chinese dataset.At the same time,it provides the interpretability of ICD coding results effectively.In the research of SHAN model,we find that too many sentences in the disease description will not match the perception unit function limitation in the classifier,and too many sentences will occupy a huge amount of storage space.To solve the above problems,this thesis proposes the reinforcement sentence sampler model(RESS),which uses the idea of random exploration of reinforcement learning.It cast the performance change of classifier as a guide to train a model,which is to judge the importance of sentences in the classification process.Through experiments,this thesis verifies that the RESS model can effectively reduce the number of useless sentences in the disease description,and reduce the degradation of classification performance caused by the reduction of the number of sentences as much as possible.To sum up,after deep learning experiment verification,this thesis proposes the Shan model using hierarchical attention,which can provide relevant sentences as interpretable basis while completing ICD coding task,for the problem of incomplete data usage and black-box problem;for the problem about too many sentences in the disease description,it proposes the RESS model using reinforcement learning,which can judge the importance of each sentence which has no manual labeling.And it can reduce redundant useless sentences effectively. |