An entity resolution framework for deduplicating proteins

Posted on:2009-11-07

Degree:M.Sc

Type:Thesis

University:University of Toronto (Canada)

Candidate:Lochovsky, Lucas

Full Text:PDF

GTID:2448390002996019

Subject:Biology

Abstract/Summary:

This thesis describes the design and implementation of a new framework PERF for deduplicating protein mentions using of a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. This framework is easily extendable to the deduplication of protein-protein interactions (PPIs). PERF translates mentions into instances of a Framework XML schema to facilitate mention comparisons. This framework also uses "virtual attribute dependencies" to "enhance" mentions with additional attribute values. PERF computes a likelihood measure based upon the textual similarity of mention attributes, and the overlap between protein classes implied by each mention's non-sequence attributes. A prototype of the framework was implemented, and preliminary tests indicate that the framework can clearly separate duplicate mentions from non-duplicate mentions.

Keywords/Search Tags:

Framework, Protein, Mentions, PERF

Related items

1	Research On Algorithms For Identifying Protein Complexes Based On Protein Network
2	Identification of entity mentions in text and their coreference resolution
3	An evaluation of PERF joins for a two-way semijoin based algorithm
4	ECPF:An Efficient Algorithm For Expanding Clustered Protein Families
5	The Research Of Protein-Protein Extraction In Biomedical Literature
6	Research On Key Techniques Of Protein-protein Interaction Extraction
7	Design And Implementation Of A Protein-protein Interactions Extraction System Based On PubMed Abstracts
8	Research Of Non-homology Computing Method Of Protein Function Prediction
9	Research On Protein-Protein Interactions Based On Primary Structure
10	Research On Algorithm Of Identifying Protein Complexes And Functional Modules On Dynamic Protein-protein Interaction Network