| Named entity recognition research,as one of the core tasks in the field of natural language processing,involves identifying and classifying entities in text and plays a vital role in subsequent work,but it also faces unique challenges.The current research program still cannot effectively solve the Out-of-Vocabulary problem(OOV problem)and nested entity problem encountered in Chinese named entity recognition tasks.The solution to the OOV problem is to add external vocabulary and use external vocabulary to cover OOV entities,but in practice it is impossible to cover all possible vocabulary,and this method leads to low model recall.When faced with the problem of nested entities,ordinary sequence labeling models usually use dynamic programming algorithms to calculate a labeling path with the highest probability,but this method cannot identify named entities with nested structures.In addition,the traditional named entity recognition mainly processes a single sentence,not the full text,which leads to the problem of sparse context features,which has a serious impact on the recognition effect.This thesis proposes two models based on the powerful pre-trained language model BERT,which has played a significant role in advancing the development of named entity recognition tasks.The first model is the Entity Regularity Perception model(ERP),which is a multi-task model for Chinese named entity recognition based on the perception of entity rules.The second model is the Multi-Word Fusion with Span Boundary Detection model(MwS),which is a Chinese nested named entity recognition model based on the fusion of multiple vocabularies and span boundary detection.The main contributions of this article are:(1)A Chinese named entity recognition multi-task model based on entity regularity perception was proposed to identify OOV entities by abandoning external vocabulary and exploring internal entity regularity with a novel idea.Based on the learned rules,the model predicts the category of OOV entities.The model is based on the standard BERT embedding layer and consists of two tasks.One of the tasks uses the BiLSTM-CRF sequence labeling algorithm to detect whether a subsequence token is a named entity,which assists the other entity regularity perception task to better predict the entity category by analyzing the internal rules of the entity.These two tasks are jointly trained and interact with each other.The effectiveness of this method in identifying OOV entities has been demonstrated through ablation experiments and model comparisons.(2)A feature enhancement method based on multi-word fusion is proposed.By analyzing the composition structure of nested entities,it is found that each nested entity contains multiple entities,and each entity corresponds to a vocabulary.Therefore,a method of fusing multiple words information is proposed to enhance the features to improve the recognition performance of the model for nested entities.The method chooses BERT as the basis and embeddings in units of characters.First,a dictionary is used to match multiple related words for each character in the target sequence and form them into a word group.Then,in the designed fusion module,the words in each word group are fused according to their weights,and finally fused again with the character vector obtained through the model embedding layer,to obtain the final word-character fusion vector.This vector is injected into the bottom layer of the BERT model and fully interacts with the multiple encoding layers.Applying the multi-word fusion method to the task of named entity recognition can effectively improve the performance of the model,which shows the advantages of this method in the task of named entity recognition.(3)A character-based span boundary detection method is proposed.Many current methods are generally inefficient for the identification of nested entities and ignore the boundary information of entities.Therefore,by analyzing the span structure,this thesis proposes a character-based boundary detection method,using two marker classifiers to predict the start position and end position of the span respectively,that is,to predict whether a character is the first character or the last character of the span,so as to divide Out of the span boundary,so as to reduce the generation of unnecessary spans and reduce the burden on the model.Accurate boundary division can also effectively improve the efficiency of subsequent span classification work.In the experiment,the method achieved the current optimal effect in the comparison of all baseline models.In this thesis,seven data sets are used to test the proposed two models.Compared with other models,the recognition accuracy of ERP and MwS models on named entity recognition tasks has been greatly improved.Among them,the ERP model achieves satisfactory results in identifying OOV entities by capturing the internal laws of entities.The MwS model has obvious advantages in identifying nested entities by fusing multiple words information to enhance features and using span boundary detection to divide boundaries. |