| Named entity recognition is widely used in tasks such as information extraction,knowledge graphs,and question answering systems,and is one of the basic tasks in natural language processing research.Substitution reaction refers to the reaction in which atoms or atomic groups in a compound or organic molecule are replaced by other atoms or atomic groups of the same type in a reagent,and is one of the most important reactions in chemistry.Substitution reaction named entity recognition refers to the identification of named entities related to substitution reaction information from massive substitution reaction texts.It is a prerequisite for practical applications such as substitution reaction-related information recommendation systems and the construction of knowledge graphs in the field of substitution reactions.It has extremely high research value..Domestic research on named entity recognition has been carried out in many fields,but there is no clear method for the field of substitution reaction,and there is also a lack of labeled training data sets.In response to the above problems,this thesis combines the text characteristics of the substitution reaction field to conduct research on named entity recognition in the field of substitution reactions.The main research contents are as follows:(1)In view of the fact that there is no relevant data set in the field of substitution reaction,this paper constructs a named entity recognition data set in the field of substitution reaction by collecting patents related to substitution reaction in the chemical database of the Chinese Academy of Sciences.According to the characteristics of named entities in texts in this field,the substitution reaction entities are divided into seven types of entities: reactant,product,time,yield,solvent,temperature,and instrument,and label after formulating labeling specifications to complete the named entity recognition in the field of substitution reaction Dataset construction.(2)Aiming at the ambiguous boundary of entities in texts in the field of substitution reactions,a named entity recognition model for substitution reactions based on BERT-BiLSTM-CRF is proposed,and related research is carried out.The model uses the BERT pre-trained language model to achieve the purpose of adjusting the word vector representation according to different contexts and enhance the semantic information of the text.The BiLSTM network is used to extract the word vector feature information,and finally the CRF model is used to output the tag sequence with the highest probability,which solves the problem of blurred entity boundaries and low accuracy in the replacement response text.Through comparative verification,the results show that the model is superior to mainstream models such as BiLSTM-CRF and BERT-BiLSTM.(3)Aiming at the polysemy of reactant and product named entities,a named entity recognition model based on RoBERTa-wwm-ext-Bi GRU-CRF is proposed.The model obtains dynamic word vectors through the RoBERTa-wwm-ext model,captures the longer-distance contextual features in the text through Bi GRU,and finally outputs them after being constrained by the CRF model.Through comparative experiments,the results show that the model can effectively solve the polysemy problem of entities. |