Font Size: a A A

Research On Tibetan Named Entity Recognition Model Based On Active Learning

Posted on:2019-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:F F LiuFull Text:PDF
GTID:2438330551460571Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named entity recognition is the basic task of information extraction.Supervised learning methods based on large-scale,high-quality tagged corpora have been widely used in English,Chinese,and other language named entity recognition tasks.For low resource languages(such as Tibetan),acquiring large-scale,high-quality tagged corpora means more time,manpower,and capital costs.The principle of Tibetan named entity recognition based on active learning proposed in this paper is as follows.First,manually annotate a small portion of the training corpus.Then,train a CRF named entity recognition model based on the artificial corpus.The model is used to annotate the unlabelled corpus in the training corpus.Select the samples with a large amount of information.Once the annotation is complete,we update datasets with the labeled samples.Repeat the above process until the stopping criteria is reached.Finally,a Tibetan named entity recognition model based on active learning is obtained with all the labeled data.The CRF-based named entity recognition model is the basis of the active learning method of this article.In order to improve the effectiveness of the Tibetan named entity recognition,especially the transliteration effect of Tibetan names,this paper proposes a Tibetan named entity recognition method based on multi-features.The experimental results show that the Tibetan Person names recognition can reacher 95.5%of F1 measure(96%in Tibetan name,94.1%in transliteration name).And F1 measures for Tibetan location names recognition and Tibetan organization name recognition are 87.9%,91.1%.This model can be used as a basic model for Tibetan named entity recognition based on active learning.This paper used two types of active learning methods.One is active learning method based on confidence selection.This method is based on the confidence score of CRF model for selected samples,and we designed two methods:k confidence selection and confidence threshold.The former selects the first k sentences with the lowest confidence to be manually annotated,and judges whether the stopping criterion is met by the difference in the annotation values of the old and new models.The latter selects sentences lower than all confidence thresholds to be manually annotated in each iteration,and the iteration stops when the confidence level of all the set of selected corpora is higher than the threshold.In the experiment,249 Tibetan sentences were used as initial annotation corpora,when k=30,and 0.01%threshold was used to mark the difference.A total of 939 Tibetan sentences were manually annotated.The F1 value of the Tibetan entity recognition reached 85.07%.Considering the labeling effect and labeling scale comprehensively,we set the confidence threshold to 0.5,2194 Tibetan sentences were manually annotated,and the F1 value of Tibetan named entity recognition reached 86.90%.Tibetan named entity recognition based on confidence active learning can significantly save the cost of tagging corpus.The second is an active learning method based on diversity sampling.The method is based on the diversity of Tibetan-named entity features of selected sentences and labelled corpora.The first k sentences with the most diverse diversity of Tibetan named entity features are manually labeled in each iteration.The stop criterion is set to 0 for all kinds of sentences and labelled corpora.This indicates that all the Tibetan named entity features in the corpora to be annotated exist in the labelled corpus.In the same experiment,k was set to 30.After 16 iterations reached the stopping criterion,710 Tibetan sentences were manually annotated,the F1 value of Tibetan named entity recognition reached 83.05%.Tibetan named entity recognition based on diversity sampling active learning can significantly save the cost of tagging corpus.
Keywords/Search Tags:Tibetan Named Entity Recognition, Tibetan Named Entity Feature, Active Learning based on Confidence Sampling, Active Learning based on Diversity Sampling
PDF Full Text Request
Related items