Font Size: a A A

Research On Chinese Named Entity Recognition With Noisy Training Data

Posted on:2020-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y S YangFull Text:PDF
GTID:2428330578479405Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Named entity recognition technology based on supervised methods often require large-scale annotated corpus.However,many domains suffer from the unclear definition of named entities and lack of annotated corpus,and it is a time-consuming and laborious work to obtain high-quality annotated data.This thesis takes the Chinese named entity recognition as our task,and defines the entity categories and annotating guidelines in multiple domains.In order to improve the performance of named entity recognition,we apply several methods to quickly obtain annotated data and explore new methods to use noisy annotations.(1)Annotating Data in Multiple DomainsFirst,we define entity categories and basic annotating processes in the areas of e-commerce,dialog and news.Then we quickly construct six annotated corpus by the way of crowdsourcing and distant supervision.Finally,we implement named entity recognition benchmark models based on conditional random fields and deep learning methods,and ex-periment on these annotated corpus.Data quality analysis and experimental results show that inconsistent annotations caused by crowdsourcing have varying degrees of impact on the two benchmark models,and neural model is more robust.Distantly supervised annotat-ing process will lead to incomplete annotations and noisy annotations,which seriously affect the performance of model.We should explore how to use data in a targeted manner.(2)Named Entity Recognition on Crowdsourcing AnnotationsWe propose an approach to perform crowd annotation learning for Chinese NER to make full use of the noisy sequence labels from multiple annotators.Inspired by adversarial learning,our approach uses a common Bi-LSTM and a private Bi-LSTM for representing annotator-generic and-specific information.The annotator-generic information is the com-mon knowledge for entities which can be easily mastered by the crowd and prevent model tending to the noisy annotations.Finally,we build our Chinese NE tagger based on the Bi-LSTM-CRF model.The experimental results show that our system achieves better scores than strong baseline systems.(3)Named Entity Recognition on Distantly Supervised AnnotationsWe propose a novel approach which can partially solve two noise problems of distantly supervised annotations for NER.In our approach,to handle the incomplete problem,we ap-ply partial annotation learning to reduce the effect of unknown labels of characters.As for noisy annotation,we design an instance selector based on reinforcement learning to distin-guish positive instances from auto-generated annotations.The experimental results show that the proposed approach can effectively reduce the negative impact and obtain better per-formance than the comparison systems on both two distantly supervised datasets.In conclusion,this thesis studies the methods of quickly constructing annotated corpus and applying noisy data.We have accomplished some primitive progress so far.We hope that these progress will contribute to the development of named entity recognition and other tasks in the field of natural language processing.
Keywords/Search Tags:Named Entity Recognition, Crowdsourcing, Distant Supervision, Adver-sarial Learnig, Reinforcement Learning
PDF Full Text Request
Related items