Font Size: a A A

Research On The Common Method Of Named Entity Recognition In Specific Domains

Posted on:2019-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2348330542475000Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named entity recognition(NER)is one of the basic tasks of Natural language processing(NLP)to identify the inherent names and identities in the text,which is widely used in many tasks such as information extraction,machine translation and information retrieval.NER has achieved good recognition effect in many domains,but which of the recognition method is designed according to the domain characteristics,which can't adapt many domains.Through investigation and analysis,this paper attempts to use a combination of conditional random field(CRF),self-learning algorithm and active learning algorithm to achieve a common method of NER in specific domains,and the method is suitable for most specific domains.There are two difficulties in the implementation of the common method of NER in specific domains.Firstly,when using the CRF to NER in a specific domain,the features selected according to the domain feature are domain independent,and the person of selecting the feature requires a wealth of the domain knowledge.Secondly,it is difficult to obtain the large scale annotation corpus of the specific domain text.In view of the above two difficulties,this paper completes the following work.We use Word Embedding similarity features to train the CRF.Firstly,Word2vec is used to train Word Embedding.The Word Embedding contain rich semantic and domain features and the Word Embedding from different corpora and different dimensions is different can be verified through the Word Embedding itself.Then,the universal statistical features included in any specific domain and the Word Embedding similarity features are selected,and the least complete feature set is selected to participate in the training of CRF with the incremental learning strategies,in order to the model can adapt to most specific domain.This paper verifies this method in the transportation domain,and the experimental results show that the Word Embedding similarity features can improve the recognition effect.However,due to few labeled samples,the recognition effect is still not satisfactory.Based on the CRF with Word Embedding similarity features,the self-learning algorithm and the active learning algorithm are combined to continue training the model.During the iterative process,the active learning is used to select the low confidence samples for manual annotation,which overcomes the problem that self-learning algorithms select too many data which has similar effect with the original training samples,and the problem that the accumulation of annotation errors due to the initial classifier errors.At the same time,self-learning algorithm is used to select high-confidence samples for self-labeling to overcome the problem that active learning algorithms can't effectively use information-rich samples.The experimental results show that the iterative training combining the above two methods can improve the recognition effect more effectively than the training using only one method,and the selection of confidence thresholds can affect model performance and manual annotation by a single variable change approach.
Keywords/Search Tags:NER, specific domains, CRF, Word Embedding, self-training algorithms, active learning algorithm
PDF Full Text Request
Related items