Font Size: a A A

Named Entities Recognition And Normalization In Biomedical Literatures

Posted on:2014-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:W T FanFull Text:PDF
GTID:2248330398950478Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a critical step of text mining in biomedical literature, Biomedical Named Entity Recognition (Bio-NER) and Gene Normalization (GN) in biomedical literature are presently one of the internationally concerned NLP (Natural Language Processing) research questions. Only when bio-entities are correctly identified and normalized, could other more complex tasks, such as, protein-protein interaction extraction, text classification, implicit knowledge discovery, be realized effectively.Contributions of this dissertation are as follows:(1) This dissertation presents a two-phase Bio-NER model which is based on two-layer stacking method and multi-agent strategy targeted at JNLPBA2004task. Our two-phase method divides the task into two subtasks:Named Entity Detection (NED) and Named Entity Classification (NEC). The NED subtask is accomplished based on the two-layer stacking method. In the first phase, where named entities (NEs) are distinguished from non-named-entities (NNEs) in biomedical literatures without identifying their types. Then six classifiers are constructed by four toolkits (CRF++, YamCha, Maximum Entropy, Mallet) with different training methods and integrated based on the two-layer stacking method. In the second phase for the NEC subtask, the multi-agent strategy is introduced to determine the correct entity type for entities identified in the first phase. Experimental results show that the presented approach can achieve an F-score of76.06%, which outperforms most of the state-of-the-art systems.(2) This dissertation presents a multistage gene normalization system targeted at BioCreAtIvE Ⅱ GN task, which consists of four major subtasks:pre-processing, dictionary matching, ambiguity resolution and filtering processing. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-core of88.42%on the BioCreative Ⅱ GM testing set. In the stage of dictionary matching, the methods of exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Hungarian algorithm. At the last step, a filter based on Wikipedia to remove the false positives that represent gene family names rather than specific gene names has been built. Experimental results show that the presented system can achieve an F-score of90.1%, which outperforms most of the state-of-the-art systems. The approaches for named entity recognition and normalization in biomedical literatures in this dissertation are efficient, and these methods can be applied to other fields in biomedical text mining.
Keywords/Search Tags:Biomedical Named Entity Recognition and normalization, two-layerstacking method, multi-agent strategy, Hungarian algorithm
PDF Full Text Request
Related items