Font Size: a A A

Research On Disease Name Recognition And Disease Normalization In Biomedical Literature

Posted on:2016-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2308330461976512Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Disease has been a major factor in human health hazards. If disease can be known more about, it can be prevented in advance. However, it takes a lot of time for people to search interested disease in the large amount of biomedical literature. As a result, automatic identification of disease names remains a challenging task in biological named entity recognition (NER). Disease NER is the problem of finding references to disease entities (mentions) in natural language text and tagging them with the semantic type, "disease." As is known to all, disease names naturally exhibit considerable variation (e.g. synonymous), which makes it difficult to retrieve the details of a particular disease. Disease name normalization therefore, is needed. The task of disease normalization consists of finding disease mentions and assigning a unique identifier to each. It is of significant importance in many lines of inquiry involving disease, including etiology (e.g. gene-disease relationships) and clinical aspects (e.g. diagnosis, prevention and treatment).Previous studies argue that a sufficient amount of research has already been conducted on biomedical NER, especially concerning gene/protein name recognition. However, disease named entity recognition has not received the same level of attention. This thesis presents an approach that combines the Conditional Random Field (CRF) model with a dictionary to recognize disease names within biomedical texts. A disease name dictionary is firstly constructed using an external biomedical resource, PharmGKB. This dictionary’s search result feature will then be introduced into a CRF model, which will be used to recognize the disease names in biomedical texts. Finally, contextual cues to pair various full disease names with their abbreviations are used to further improve the recognition performance. Experimental results show that our approach achieves an F-measure of 83.82% on the NCBI disease corpus. A disease named entity recognizer is constructed to facilitate text mining, which presents the recognized named entities in the form of visualization.This thesis also presents a disease normalization method based on semantic resource. There is a problem in traditional disease name normalization that the description of disease symbols in biomedical databases is not so complete that it is difficult to determine the specific meaning of the ambiguous disease name. Semantic information extracted for each disease symbol from MEDIC vocabulary and MEDLINE abstracts is used to calculate the similarity with the context information of ambiguous disease name. As a result, the disease symbol with the highest score is the symbol of the ambiguous disease name. Our algorithm achieves 0.7970 micro-averaged F-measure and 0.7949 macro-averaged F-measure.
Keywords/Search Tags:Disease Mention Recognition, Disease Name Normalization, Exact Matching, Approximate Matching, Disambiguation
PDF Full Text Request
Related items