Font Size: a A A

Research On Virus Named Entity Recognition Methods Based On Language And Distantly Supervised Model

Posted on:2022-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:H L MuFull Text:PDF
GTID:2480306350453234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the interpretation of the micro world by high-throughput sequencing,the prediction of virus-host relationship has attracted more and more attention.There are a lot of relationships between virus and host that have been proved by experiments in the existing medical literature.Text mining technology can dig out these relationships hidden in the literature.This process includes named entity recognition,entity alignment and relation extraction.The construction of knowledge base of virus-host relationship can help scholars to verify the potential virus-host relationship through the reasoning and prediction of knowledge base.And Virus named entity recognition is the premise and key to extracting the relationship between virus and host in biomedical texts.The diversity of virus names,the emergence of new entities and the nesting of entities make the task of virus named entity recognition challenging.This thesis proposes a virus named entity recognition method based on language model and a virus named entity recognition method based on distant supervision.The main work done by this thesis is as follows:Firstly,a method of virus named entity recognition based on language model is proposed.The same word vector can express different semantics in different semantic environments,but the traditional one-hot coding or general domain representation word vector can get poor results when applied to microbial domain models.Based on the above problems,this thesis selects the mainstream language models:Word2Vec,ELMo,BERT,uses a large number of unlabeled microbial corpus training to get the context representation,then uses BiLSTM for feature extraction,CRF for tag prediction.The experimental results show that the BERT language model performs best in the task of virus named entity recognition.Secondly,a method of virus named entity recognition based on remote supervision is proposed.Supervised learning needs a lot of manual annotation of corpus,which can be solved by remote supervised learning.The remote monitoring method automatically marks the target entity in the text according to the third-party dictionary,but some marks and data noise problems are easy to occur,which leads to low quality of the marked data,and the performance of the model decreases.Based on the above research results,this thesis combined the idea of multi-layer perceptron and reinforcement learning to propose a method combining BILSTM-CRF and reinforcement learning for virus named entity recognition to solve the above two problems.Experiments show that the method proposed in this thesis can effectively reduce the mislabeled data caused by the remote monitoring method,and has a good effect on improving the model performance.Finally,by virus knowledge base,this article uses the optimal performance of a virus named entity recognition model based on language model to predict a large number of medical literature and found a lot of did not appear in the knowledge base but meaningful virus entity,proved that the research significance of virus named entity recognition,provides the relationship between virus and host extraction research foundation.
Keywords/Search Tags:Text Mining, Virus Named Entity Recognition, Language Model, Distant Supervision
PDF Full Text Request
Related items