Font Size: a A A

Recognizing Named Entities In Biomedical Literatures

Posted on:2010-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:R P ZhouFull Text:PDF
GTID:2178360302460328Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Biomedical Named Entity Recognition (Bio-NER) is a critical step for biomedical text mining, only when bio-entities are correctly identified could other more complex tasks, such as, gene/protein normalization and protein-protein interaction extraction, be performed effectively. However, due to the irregularities and ambiguities in bio-entities nomenclature, Bio-NER remains a challenging task.This thesis focuses on the research of recognizing named entities in English biomedical literatures, both JNLPBA2004 and BioCreAtIvE 2 GM. datasets are used in the experiments.Contributions of this thesis can be summaried as follows:(1) This thesis presents a two-phase Bio-NER approach based on Conditional Random Fields (CRF), which divides JNLPBA2004 shared task into two subtasks: Named Entity Detection (NED) and Named Entity Classification (NEC). These two subtasks are finished in two phases: at the first phase (for NED subtask), named entities in biomedical literatures are distinguished from non-named-entities by a CRF model, without identifying its type; at the second phase (for NEC subtask), another CRF model is used to determine the correct entity type for each identified entity. To achieve a better performance, four post-processing algorithms are employed before NEC subtask. Experimental results show that the presented approach is effective not only in the reduction of training cost but also in the improvement of the performance. It achieves an F1-measure of 74.47% on JNLPBA2004 datasets, which is 1.92% higher than the top system in JNLPBA2004 challenge.(2) To deal with BioCreAtIvE 2 GM task, this thesis presents a Bio-NER approach, in which divergent models are implemented and integrated. In the experiments, six divergent models are implemented with different machine learning algorithms and dissimilar feature sets. And their results are integrated by two strategies, i.e. simple set operations (intersection and union) and voting. Experimental results show that integrating divergent models can improve the tagging performance, and the presented approach can achieve an F1-measure of 87.89% on BioCreAtIvE 2 GM datasets, which is 0.68% higher than the top system in BioCreAtIvE 2 challenge.
Keywords/Search Tags:Text Mining, Named Entity Recognition, Biomedical Named Entity Recognition, Machine Learning
PDF Full Text Request
Related items