Font Size: a A A

The Research Of Biomedical Name Entity Recognition By Combining Dictionary Based And Machine Learning Based Method

Posted on:2010-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2178360302460349Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Biomedical name entity recognition (Bio-NER) is a task that recognizes professional terminology in the field of molecular biology and medicine. Professional vocabulary includes biomedical name entity as well as the location of their activities, such as protein, DNA, RNA, cell lines. Currently there are the mass of biomedical literature texts for mining knowledge. In order to obtain links among biomedical entities, we should identify genes, proteins and other biomedical entities from literatures. Therefore, biomedical name entity recognition is basis of other text mining technologies, such as the relationship extraction, hypothesis generation and text classification.Nowadays there are three methods on the research of biomedical name entity recognition, including dictionary-based method, rule-based method and statistical machine learning method. Dictionary-based approach is relatively simple and practical, but its performance is limited to the size and quality of dictionaries. Rule-based method depends on the completeness and rationality of the rules, but it has lack of adaptability. Statistical machine learning method uses artificial tagging corpus for training, generates the target model, and then uses the model to predict the unlabeled corpus. The advantage of its method is that it brings robustness of system, and this method is used popularly.As we know, there isn't any lexicon that can include the whole biomedical entities and biomedical entities emerge in endlessly. To make up defects of dictionary-based method, and to combine with the advantages of statistical machine learning methods, we propose a new combination between dictionary and machine learning method in this thesis. First, we download dictionary information about biomedical name entities from relative biomedical websites; combine with Conditional Random Fields (CRFs) model to give Part Of Speech-Entity (POS-Entity) marks for corpus. We adapt distributed strategies to depart entities into different groups, and then generate different tagging models respectively. Besides we choose more effective features followed by the characteristics of biomedical name entity, adapt CRFs model to complete task of biomedical name entity recognition.We can get effectiveness from experimental results to show the influence of approach namely combination of dictionary based and machine learning based approach. The results obtained from the experiment on JNLPBA2004 corpus shows that the F-score can be improved from 72.83%, which is attained by adding POS-Entity tags to the CRFs model after adapting distributed strategies without any post-processing. The performance further increased to 73.39% after post-processing.
Keywords/Search Tags:Biomedical Name Entity Recognition, Distributed Strategy, Features, Entity Dictionary, Conditional Random Fields
PDF Full Text Request
Related items