Font Size: a A A

Synonym and homonym resolution of gene and protein names

Posted on:2003-04-15Degree:Ph.DType:Dissertation
University:Columbia UniversityCandidate:Yu, HongFull Text:PDF
GTID:1464390011481735Subject:Information Science
Abstract/Summary:
The current MEDLINE database includes more than 12 million computer-readable could automatically identify biological knowledge stored in MEDLINE. Before applying the NLP, however, one must identify gene/protein names.; I present Gene/Protein name mark up (GPmarkup), a software and links the names to their synonyms (For example, Apo3, LARD, and lymphocyte associated receptor of death all represent the same protein). GPmarkup also disambiguates homonyms (i.e., two or more gene/protein names spelled alike but different in meaning) by mapping them (e.g., LARD) to the full forms (e.g., lymphocyte associated receptor of death).; GPmarkup first identifies the patterns that authors use to introduce synonymous gene/protein names and then extracts the synonyms based on the patterns. It implements a set of pattern-matching rules to map short gene/protein abbreviations to their full forms when they are associated with the patterns of <full form>(<abbreviation>) or <abbreviation>(<full form>).; GPmarkup applies positional function keywords (e.g., receptor and kinase) to separate gene/protein terms from other abbreviation-full form pairs. We applied GPmarkup to 11 million MEDLINE records to generate a knowledge source of paired gene/protein abbreviations and full forms. The knowledge source is then used to mark up gene/protein terms. GPmarkup has 73 percent recall and 93 percent precision in marking GPmarkup recognizes gene/protein aliases (e.g., Apo3 and LARD) via the patterns (e.g., solidas or comma) and applies a set of knowledge-based filters to remove nongene/protein names. GPmarkup has an overall precision of 71% on both MEDLINE and journal articles and 90% precision on the more suitable full-text articles.; To disambiguate homonymous gene/protein terms, GPmarkup applies unsupervised machine learning methods to identify the full forms of the ambiguous abbreviations. GPmarkup has a 93 percent precision in identifying the full forms of undefined abbreviations. Note that the short gene/protein names are the abbreviations of their full forms. When the full forms are identified, we can apply positional function keywords to separate gene/protein terms from other abbreviation-full form pairs.
Keywords/Search Tags:Full forms, Gene/protein, Names, MEDLINE, Gpmarkup
Related items