| Named Entity Recognition is a key basic task in natural language processing tasks.It is widely used in natural language processing tasks such as text classification,recommendation systems,and information retrieval.Its accuracy directly determines the effect of downstream tasks.Xinjiang local medicine named entity recognition refers to the identification of medical named entities related to Xinjiang local medicine from massive web texts.It is a prerequisite work for high-level applications such as intelligent medicine,medical knowledge mining,and medical clinical decision support systems.It is applied in theory and practice.Both have high research value.Since most of the domestic medical named entity recognition technologies are based on traditional deep learning methods,there is no clear method for the research on Xinjiang local medicines,and there are certain limitations in the research work.Therefore,in response to the above problems,this paper,based on the relevant theories of the pre-training model algorithm,conducts research on two pre-training methods,auto-encoding algorithm and auto-regression,to achieve accurate identification of Xinjiang local medicine named entities.The specific work is as follows:(1)In response to the issue of the Xinjiang local medicine corpus that is not publicly available in China,this article uses crawler technology to crawl relevant medical texts on social platforms such as relevant medical websites,Weibo,Tieba,and Zhihu,constructing a Xinjiang local medicine corpus,and correcting it under the guidance of experts The corpus is marked to lay the foundation for further research and analysis of Xinjiang local medicine texts.(2)This paper proposes a Xinjiang endemic medicine named entity recognition algorithm(BERT-Bi LSTM-CRF)based on a bidirectional long-term short-term memory network(Bi LSTM)fused with a self-encoding pre-training method.This method uses the bidirectional Transformer structure for pre-training,and uses the random masking(MASK)training method to dynamically generate semantic vectors according to the sequence information of characters,the semantic representation of characters is enhanced,and the character vector sequence is used as intput,which is obtained through the training of the bidirectional LSTM layer Remember the tag sequence of long-distance information,and finally the CRF module integrates the probability of all tag sequences to get the final distribution result as the conditional probability of the global optimal output node,and considers the interdependence between tags to obtain the global optimal test result.This method has achieved an accuracy rate of 95.77%,a recall rate of 89.47%,and an F1 value of 92.52% in the research on the named entity recognition of local medicines in Xinjiang.(3)In order to solve the problem that the traditional autoregressive model cannot use the context information at the same time,the Xinjiang local medicine named entity recognition method based on the autoregressive pre-training model XLNet is proposed.All the permutations of a sequence are used as the input of the modeling,so that each position can use the information of all other positions,realize the integration of context information,and enhance the model’s ability to recognize the named entities of Xinjiang endemic medicine.This method has achieved an accuracy rate of 96.52%,a recall rate of 92.63%,and an F1 value of 94.53% in the research on the named entity recognition of Xinjiang local medicine. |