Cross-document entity co-reference resolution in noisy environments

Posted on:2009-07-16

Degree:Ph.D

Type:Dissertation

University:Brandeis University

Candidate:Baron, Alex

Full Text:PDF

GTID:1448390002492794

Subject:Language

Abstract/Summary:

Cross-document entity co-reference resolution task is part of and largely based on many Natural Language Processing (NLP) innovations that have been developed in the past several decades and are still evolving. At the core of the task is determining whether two or more pieces of information, each mentioned locally in various documents, refer to the same global entity. The complexity of the task increases when the documents represent a variety of sources including unstructured text, speech, and foreign languages.;This dissertation presents a complete end-to-end solution to the problem of cross-document entity co-reference. The research divides the task into two major components: Name Matching and Entity Disambiguation. The two components have been designed for scalability, extensibility, and incremental processing.;The Name Matching component consists of more than a dozen algorithms and produces equivalences between corpus names. The algorithms use a variety of information sources which fall into four categories: World Knowledge, Web Knowledge, String Similarity, and Statistical Extraction. The produced alternatives represent various name forms: misspellings, aliases, abbreviations, alternative spellings, short and long versions, nicknames, etc.;The Entity Disambiguation component implements a 3-stage agglomerative clustering algorithm to resolve local entities to global clusters. Each of the algorithm stages uses combinations of various features to adjust discriminative levels in cluster comparisons. The features are retrieved from document metadata, information extraction, unsupervised topicality, and more.;The system has been evaluated in several ways. The Name Matching component has been tested on a corpus consisting of more than half a million documents from various genres. The performance of the Entity Disambiguation component has been measured against human-annotated collections of English and Arabic documents mentioning ambiguous person and organization names. The truth data were produced using a cross-document annotation tool designed specifically for this research.*.;*Copyright 2008 by BBN Technologies Corp. All Rights Reserved. Distribution Statement A (Approved for Public Release; Distribution Unlimited).

Keywords/Search Tags:

Entity, Cross-document, Task

Related items

1	Research On Key Issues Of Document-Level Entity Relation Recognition
2	Research On Named Entity Recognition And Entity Relationship Extraction For Document Corpus
3	Named Entity Recognition In Cross Language And Cross Domain Situations
4	Research And Implementation Of Entity Linking’s Key Problem
5	Graph-based approaches to resolve entity ambiguity
6	Research And Application On Official Document Flowing System Based On Task Group Mode
7	Research On Cross-language Document Sorting Learning Method Based On Bilingual Document Similarity
8	Designand Implementation Of Cross-Domaindigital Document Security Control System
9	English Entity Answer Extraction And Home Find
10	Cross-Lingual Entity Linking And Semantic Query Processing Based On Knowledge Graphs