Font Size: a A A

Research On Tibetan Named Entity Recognition Based On Weakly Supervised Learning

Posted on:2021-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:P SunFull Text:PDF
GTID:2438330602498432Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named Entity Recognition(NER)is one of the basic and key tasks of Tibetan information processing.Tibetan NER is to find and classify named entities from Tibetan text,and its result will affect the performance of subsequent tasks such as Tibetan information extraction and information retrieval.Currently,Tibetan NER is mainly based on supervised statistical machine learning methods.Traditional feature engineering relies on the knowledge and experience of linguists to extract the shallow statistical characteristics of named entities.It is difficult to represent the semantic information of named entities.However,expanding the size of training corpora faces the problem of high cost of manually labeling corpora.Therefore,it is of great research value to build a high-performance Tibetan NER model based on small-scale labeled corpora.This paper studies Tibetan NER based on weakly supervised learning and main tasks are as follows:Learn the distributed representation of words by unlabeled text,construct word representation features to represent semantic information and append to the statistical machine learning model for Tibetan name recognition,the performance of model is improved.In this paper,four kinds of word representation features:word embedding feature,binarized word embedding feature,word embedding clustering feature,and Brown clustering feature are studied,and a weakly supervised Tibetan name recognition model is constructed by Conditional Random Fields model.Aiming at the situation that the word embedding feature and binarized word embedding feature fail in some NER systems reflected in related research,a novel sampling strategy for word representation features is proposed.Experiments show that the word representation features can effectively represent the semantic information of the name entity,and the F1 score of the supervised statistical model is increased from 88.66%to 91.90%.The sampling of word representation features can make better use of word embedding feature and binarized word embedding feature,and reduce the training time of the model by about 90%and 50%,respectively.By using a combination of active learning and self-learning,a weakly supervised Tibetan NER learning model based on unlabeled corpora and small-scale labeled corpora is used to reduce the cost of labeling corpora.This paper studies three active learning sampling strategies such as Least Confidence,Maximum Normalized Log-probability,and Content Similarity,and implements active learning-based Tibetan NER model.Then,integrate self-learning sampling strategy based on confidence-based into active learning models,a weakly supervised Tibetan NER model combining active learning and self-learning is constructed.Experiments show that compared with a supervised statistical machine learning model for Tibetan NER,without losing the performance of above model,the active learning method reduces the amount of labeled training corpora by about 74%,combining active learning and self-learning methods can reduce about 77%of the amount of labeled training corpora.Therefore,the combination of active learning and self-learning can reduce the cost of labeling training corpora,and has certain advantages over active learning methods.
Keywords/Search Tags:Tibetan Named Entity Recognition, Weakly Supervised Learning, Word Representation Feature, Combining Active Learning and Self-learning
PDF Full Text Request
Related items