Named entity recognition task is one of the basic tasks in the field of natural language processing,supporting many important tasks such as knowledge graph and relationship extraction.Named entities are an important part of textual semantic information,and the accuracy of their recognition directly affects the understanding of textual semantic information.With the continuous development of various information media,the Internet has become a huge treasure house of information knowledge,full of free text in all aspects.How to design an effective named entity recognition algorithm to accurately identify entities in various scenarios is a topic of practical significance.At present,the named entity recognition model based on deep learning has achieved good results on large-scale annotated corpus.However,in many real-life scenarios,such as medical,microblog and other fields,there are only extremely limited annotation corpora.Existing named entity methods generally can not achieve high accuracy when the annotated corpus is insufficient.Therefore,how to design a Chinese named entity recognition algorithm suitable for a few annotated corpus has become a difficult problem in current research work.This paper takes the task of Chinese named entity recognition on a few annotated corpus as the main research object,and carries out research work such as cross-domain migration based on semantics and tags,and diversified in-domain data generation based on generation adversarial networks.The main research work of the thesis is as follows:First,this paper proposes a cross-domain sample migration model based on semantics and tags,and uses cross-domain data to solve the problem of insufficient labeling data in specific domains.The algorithm models data distribution in different fields from two aspects:label and semantics.In the process of using cross-domain data for transfer learning,the model can effectively migrate data according to data differences in different fields.Experimental results show that the model can effectively transfer data in the source domain based on semantic and label information,and enhance the generalization ability of the target domain model.Then,this paper proposes a Chinese named entity recognition model based on data augmentation,which expands the annotated data in a limited domain.The model uses the generator to learn the data distribution in the field,combined with the adversarial training of the discriminator to improve the quality of the generated data,and then uses the generated data of high quality to expand the training set.Experimental results show that the model can effectively use the distribution of real data in the simulation domain,generate diversified data in the domain,and increase the generalization of the model.Finally,a Chinese named entity recognition model combining transfer learning and data augmentation is further proposed.The model uses the migrated cross-domain data and data-augmented generated data together to expand the training set of the target domain.The experimental results show that combining these two methods from the perspective of expanding the dataset can further reduce the impact of insufficient annotation data on the performance of the named entity recognition model. |