Font Size: a A A

An entity resolution framework for deduplicating proteins

Posted on:2009-11-07Degree:M.ScType:Thesis
University:University of Toronto (Canada)Candidate:Lochovsky, LucasFull Text:PDF
GTID:2448390002996019Subject:Biology
Abstract/Summary:
This thesis describes the design and implementation of a new framework PERF for deduplicating protein mentions using of a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. This framework is easily extendable to the deduplication of protein-protein interactions (PPIs). PERF translates mentions into instances of a Framework XML schema to facilitate mention comparisons. This framework also uses "virtual attribute dependencies" to "enhance" mentions with additional attribute values. PERF computes a likelihood measure based upon the textual similarity of mention attributes, and the overlap between protein classes implied by each mention's non-sequence attributes. A prototype of the framework was implemented, and preliminary tests indicate that the framework can clearly separate duplicate mentions from non-duplicate mentions.
Keywords/Search Tags:Framework, Protein, Mentions, PERF
Related items