| Electronic medical record(EMR)has become an indispensable part of the work of medical institutions,it contains important information such as clinical discovery,diagnosis,drug prescription and so on.The information has been applied to studies of Natural Language Processing(NLP)in clinical field such as clinical decision-making,mortality prediction and adverse drug reactions analysis.However,different medical institutions have different standards for the writing of EMR,clinical terms normalization can improve the ability of sharing clinical information between different institutions and the interoperability between different application platforms in the clinical field.and it can improve the quality of data and help optimize the machine learning model based on EMR data.In this thesis,the research is based on the clinical terms normalization task released by the national NLP clinical challenges(n2c2)in 2019.It needs to map the clinical terms from EMR to the concept unique identifier(CUI)in the unified medical language system(UMLS),every CUI has several describe string.This thesis focuses on the scarcity of clinical terms normalization corpus and the difficulty of existing normalization methods to solve the problem of different word forms with the same meaning.The research contents are as follows:(1)This thesis proposes a method of transferring the word features of the pre-training language model in clinical domain into the Siamese recurrent neural network.Traditional research methods use feature engineering combined with machine learning to avoid the need of large-scale corpus,but it needs to define the feature extraction method.Siamese network,which uses the same sub network to process similar input,is suitable for calculating semantic similarity.It performs well in small-scale corpus,but it has not been applied to the normalization of clinical terms.In this thesis,the word features of pre-training model in clinical domain are embedded into the Siamese recurrent neural network as the initial word vector to normalize the clinical terms.Through comparative experiments,several different pre-training language models and different recurrent neural networks are selected,and compared with the common term normalization system MetaMap,which proves the effectiveness of the method in the small-scale annotation corpus.(2)This thesis proposes a method of crosslingual texts to calculate the similarity.For the large scale of UMLS,candidates generation of CUI description string is necessary.The traditional candidate set generation method is based on morpheme variants and common words,which can not solve the problem of different word forms with the same meaning.This paper proposes a method of crosslingual texts to calculate the similarity,it compares the semantics of current languages through comparing the semantics of other languages.This method can not only compare synonyms,but also add and delete words,adjust sentence structure and word order.In this thesis,word character based methods and term frequency-inverse document frequency(TF-IDF)based methods are applied to the generation of candidate sets,then the method of crosslingual text similarity calculation is used to supplement or update the options of the candidate set.The comparison experiment shows that this method effectively improves the recall rate of candidate set and the accuracy rate of normalization. |