Font Size: a A A

Cross-document entity co-reference resolution in noisy environments

Posted on:2009-07-16Degree:Ph.DType:Dissertation
University:Brandeis UniversityCandidate:Baron, AlexFull Text:PDF
GTID:1448390002492794Subject:Language
Abstract/Summary:
Cross-document entity co-reference resolution task is part of and largely based on many Natural Language Processing (NLP) innovations that have been developed in the past several decades and are still evolving. At the core of the task is determining whether two or more pieces of information, each mentioned locally in various documents, refer to the same global entity. The complexity of the task increases when the documents represent a variety of sources including unstructured text, speech, and foreign languages.;This dissertation presents a complete end-to-end solution to the problem of cross-document entity co-reference. The research divides the task into two major components: Name Matching and Entity Disambiguation. The two components have been designed for scalability, extensibility, and incremental processing.;The Name Matching component consists of more than a dozen algorithms and produces equivalences between corpus names. The algorithms use a variety of information sources which fall into four categories: World Knowledge, Web Knowledge, String Similarity, and Statistical Extraction. The produced alternatives represent various name forms: misspellings, aliases, abbreviations, alternative spellings, short and long versions, nicknames, etc.;The Entity Disambiguation component implements a 3-stage agglomerative clustering algorithm to resolve local entities to global clusters. Each of the algorithm stages uses combinations of various features to adjust discriminative levels in cluster comparisons. The features are retrieved from document metadata, information extraction, unsupervised topicality, and more.;The system has been evaluated in several ways. The Name Matching component has been tested on a corpus consisting of more than half a million documents from various genres. The performance of the Entity Disambiguation component has been measured against human-annotated collections of English and Arabic documents mentioning ambiguous person and organization names. The truth data were produced using a cross-document annotation tool designed specifically for this research.*.;*Copyright 2008 by BBN Technologies Corp. All Rights Reserved. Distribution Statement A (Approved for Public Release; Distribution Unlimited).
Keywords/Search Tags:Entity, Cross-document, Task
Related items