Font Size: a A A

Research On Chinese And English Text Entity Recognition Technology Based On Pre Training Language Model

Posted on:2021-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:M Y ZhangFull Text:PDF
GTID:2518306230991909Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Named entity recognition refers to the recognition of entities with specific meaning in text.In natural language processing tasks such as information extraction,question answering system,syntactic analysis and machine translation,the accuracy of named entity recognition affects the final effect of these tasks.At present,the main research methods of named entity recognition are based on statistics.Such methods often need artificial feature construction,which consumes too much time and energy.There are some methods based on deep neural network,such as convolution neural network,cyclic neural network and its variant long and short-term memory,which are often limited by the scale of training data,and the problem of named entity recognition in complex text can not be solved well.For those open domain named entities that express irregularly and lack training corpus,the boundary judgment of category entities is often not so accurate.In view of the above problems,this paper studies named entity recognition technology based on pre training language model.The main content includes the following two aspects.(1)Aiming at the particularity of defining entity,the difference of length between entities and the difficulty of defining entity boundary,this paper proposes a sequential annotation method based on pre training language model.By using the semantic representation of Bert rich vectors,the model uses conditional random fields to learn the dependency between output tags.The goal of extracting the dependency between output contexts is realized.It solves the problem of insufficient semantic information of traditional word embedding method,and deepens the connection between output sequences through conditional random fields.Compared with other models,this model has the best effect.It is proved that the proposed model is suitable for the task of extracting definitions from free texts of English textbooks.(2)In view of the complexity of medical text structure,the particularity of Chinese text and the different entity categories of the same word in different context,this paper proposes a combination model of two-way short-term memory and conditional random field.By introducing cnmer2017 and cnmer2018 corpora,the proposed model improves the performance of judging entity boundary,judging entity category through context,and solves the problem of sparsity in training data.In the contrast experiment results,the effect of the model is better than that of the nine baseline models,which proves that the proposed model is suitable for Chinese medical text nomenclature recognition task.
Keywords/Search Tags:Pre-training language model, Long short-term memory, Bidirectional converter coding characterization, Named Entity Recognition
PDF Full Text Request
Related items