Font Size: a A A

Affix And Character Word Level Based Method For Recognizing Biomedical Entity Names

Posted on:2013-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:X Z WuFull Text:PDF
GTID:2248330371483642Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of molecular biology, genomics and proteomics, the number ofbiomedical literatures is increasing with an amazing pace. It is necessary to accessinformation from such a mount of literatures fast with computer. There is no uniformnaming standards for biomedical entities such as protein and DNA. Based on thisphenomenon, recognizing biomedical entity name from literatures is the first step ofobtaining information from literatures.Machine learning is method with high accuracy to recognize biomedical entity names.It is used and improved by more and more researchers. Hidden Markov Models (HMM),Maximum Entropy Markov Models (MEMMs) and Conditions Random Fields (CRFs) arethree machine learning methods in text mining. They are discussed according to theperformance of tagging part of speech and recognizing entity names in this paper. Thefeatures of word formation and syntax are helpful for tagging part of speech. Prefixes andsuffixes are introduced in HMM, MEMMs and CRFs for tagging part of speech. There aresome words can not tag part of speech according to prefixes and suffixes, such aspreposition and pronoun. The number of these words is small. String matching is used fortagging the part of speech of these words.In the process of recognizing of biomedical entity names, five types of biomedicalentity are recognized with character words and phrase boundary. There are two noduses inthis process. For one hand, one entity name can be used as different types. For the otherhand, one character word may be involved in different types. A strategy of character wordlevel is used for distinguish entity types. The words with high frequency used in biomedicalentity are chosen as character words. There are five set of character words for protein, DNA,RNA, cell line and cell type respectively. The results showed that the strategy of characterword level make these methods perform better than without it.GENIA Corpus of GENIA Project is used as the primary data set in the experiment.The results of tagging part of speech and recognizing biomedical entity names with HMM,MEMMs and CRFs are compared according to precision, recall and F score. The resultsshowed that CRFs has a better performance than HMM and MEMMs. With the comparison of CRFs and GENIA tagger, it is shown that affix and character word level basedbiomedical entity names recognizing method has better performance than GENIA tagger inprecision and F score. Particularly, CRFs performs obviously better than GENIA tagger inidentifying DNA and RNA. F score is increased by6.29%and4.56%respectively.
Keywords/Search Tags:Bioinformatics, Biomedical entity names recognizing, HMM, MEMMs, CRFs
PDF Full Text Request
Related items