| In the medical clinical field,the ability to accurately recognize named entities in electronic medical records is of great significance for the construction of a complete medical knowledge base and the construction of accurate patient user portraits.In the field of Chinese named entity recognition,due to the word boundary problem of Chinese word segmentation and the diversity of Chinese expressions,it is currently impossible to accurately recognize named entities in Chinese electronic samples.At present,the deep learning method of Chinese Medical Named Entity Recognition(CMNER for short,the same below)is usually to input the embedding vector of Chinese characters into the neural network system.So as to avoid the noise problem caused by word segmentation for Entity Recognition.However,this often ignores the rich semantic information at the lexical level.To solve this problem,this paper adds wordlevel embedding vectors based on the forward maximum matching algorithm(FMM)to the network to represent richer semantic and location features.In order to solve the difficulty of word boundary in Chinese language,this paper proposes two deep learning models based on the joint enembedding of Chinese characters and words of different granularity.In this paper,based on the traditional bilstm-crf model,a parallel embedding model and a mixed embedding model are proposed according to the different ways of lexical vector passing into the model.In addition,this paper also compares the influence of three different feature information on the deep learning model: the improved n-gram feature,the entity label combined position(TLCP),and the entity label not-combined position(TLNP).The experimental results show that the joint embedding method based on different granularity of Chinese characters and words can obtain richer semantic and positional features,and achieve better results in task 2 of the China Conference on Knowledge Graph and Semantic Computing(CCKS 2017).Relevant work results were presented in IJCNN2019(CCF-C)conference.According to the characteristics of data sets of labeled and unlabeled samples,the active learning method is applied to the parallel embedding model and the mixed embedding model,and the active learning-deep learning joint model is proposed.This paper obtained 1596 annotated Chinese emr texts and 10420 unannotated Chinese emr texts from the competition data.In order to make full use of unlabeled samples and effectively expand the number of samples in the training set,this paper proposes a poolbased active learning method to select 200 representative unlabeled texts from the perspectives of diversity and uncertainty.For the samples selected by active learning,this paper adopts the method of crowdsourcing to manually label,and 106 new words are obtained.The experimental results show that by adding active learning to the parallel embedding model and the mixed embedding model,not only more accurate entity recognition can be achieved,but also faster model convergence can be achieved.To sum up,this paper mainly studies the parallel embedding model and mixed embedding model combined with active learning.Experiments show that the proposed method can achieve faster convergence and better experimental results.In addition,this paper is innovative in the construction of n-gram features and the design of crowdsourcing labeling methods,which has certain reference significance in practical applications. |