Font Size: a A A

Leveraging knowledge of document structure and named entities for information extraction

Posted on:2006-08-06Degree:M.SType:Thesis
University:Case Western Reserve UniversityCandidate:Duncan, Frank BissettFull Text:PDF
GTID:2458390008451287Subject:Computer Science
Abstract/Summary:
We present an end-to-end process for extracting key information from online and offline documents that define the body of literature of a given domain. The ultimate goal of implementing such a process is to identify leading authorities in the field, such as authors, publications and institutions. This process is comprised of a number of stages, including: defining the domain, identifying and acquiring the data sources, which includes processing and extracting the information. Perhaps the most critical portion of the process is extracting the information from the texts and mapping it to an analytical data model. We will demonstrate how this can be facilitated by examining the structure of the document, expressed explicitly in HTML or implicitly in the structure of scholarly literature. We will also demonstrate the necessity of a database to identify a named entity, and a set of heuristics to use this database effectively.
Keywords/Search Tags:Information, Structure, Process
Related items