Font Size: a A A

Research And Implementation Of Pseudo Sample Generation Algorithm In Chinese Named Entity Recognitio

Posted on:2021-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y M ZhengFull Text:PDF
GTID:2568306905975459Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The purpose of named entity recognition is to identify meaningful entities such as person name,local name,organization name,etc.in a given text.It is a basic and important task in the field of natural language processing and an important step in many downstream tasks,such as machine translation,relationship extraction,automatic question answering,etc.,which has considerable application value.Deep learning technology is widely used in the named entity recognition because of its self-learning ability of feature representation of specific tasks,and has made great progress.However,there are still some problems in the task of Chinese named entity recognition:(1)Chinese named entity recognition lacks labeled data in many fields,and it is very time-consuming and expensive to label enough training data for deep learning model.(2)Chinese named entity recognition is usually modeled as a character level sequence labeling problem,because Chinese sentence is a long string,and there is no explicit separator like space to separate it into words.Due to the particularity and complexity of Chinese language,the research of named entity recognition has brought great challenges.Therefore,this thesis explores and studies these problems.The main research work of this thesis is as follows:(1)Considering the problem of insufficient training labeled data in Chinese named entity recognition task,two named entity pseudo sample generation algorithms are proposed,one is based on random pseudo sample generation algorithm and the other is based on loss pseudo sample generation algorithm.The two algorithms have a common assumption,that is,the entities in the same category can be replaced with each other.On the basis of this assumption,more pseudo training data can be obtained by the proposed pseudo sample generation algorithm by using all kinds of named entities in the existing labeled data.Finally,on the current popular BiLTSM-CRF model based on character embedding,the effectiveness of this method in Chinese named entity recognition task is verified by comparative experiments.(2)In order to alleviate the problem brought by the particularity and complexity of Chinese to named entity recognition,this thesis proposes a BiLSTM-Self-Att-CRF model with word vector.First of all,in the BiLTSM-CRF model based on character embedding,all potential words matching the dictionary are added as auxiliary features,so that the model can use the word information without the influence of segmentation errors.Secondly,multi-head self-attention mechanism is introduced to learn the global dependent information in the whole sequence.Experimental results and analysis show that the proposed model can achieve better results.Finally,the algorithm of named entity pseudo sample generation based on loss is integrated into the model.Through the experimental comparison,it is proved to be effective in improving the performance of named entity recognition,and better than other excellent models.(3)This thesis designs and implements a named entity recognition system for news text.The system adopts the Flask framework based on Python and API technology,deploys the named entity recognition model to the system,realizes the entity recognition function on the Web,and provides the data annotation function to expand the training samples.
Keywords/Search Tags:Named Entity Recognition, Pseudo Sample Generation Technology, BiLSTM, Conditional Random Field, Self-Attention Mechanism
PDF Full Text Request
Related items