Font Size: a A A

Optimizing information extraction programs over evolving text

Posted on:2011-02-23Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Chen, FeiFull Text:PDF
GTID:1448390002969677Subject:Computer Science
Abstract/Summary:
Information extraction (IE) is the problem of extracting structured data from unstructured text. Examples of structured data are entities such as organizations and relationships such as "company X is acquired by company Y." Examples of unstructured text are emails, Web pages, and blogs.;Most current IE approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are evolving, in that documents can be added, deleted and modified. An example of evolving text is Wikipedia. Therefore, to keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus snapshots. How to efficiently execute such repeated IE?;In this dissertation I describe solutions that efficiently execute such repeated IE by recycling previous IE efforts. Specifically, given a current corpus snapshot U, these solutions first identify text portions of U that also appear in the previous corpus snapshot V. Since these solutions have already executed the IE program over V, they can now recycle the IE results of these parts, by combining these results with the results of executing IE over the remaining parts of U, to produce the complete IE results for U. We describe three systems that deal with successively more complex IE programs. The first system, Cyclex, recycles for IE programs that contain a single IE blackbox. The second system, Delex, recycles for IE programs that consist of multiple IE blackboxes. The third system, CRFlex, also considers multi-blackbox IE programs, but some of these blackboxes are based on a leading statistical learning model: Conditional Random Fields. I present experiments on real-world data that validate the proposed solutions.
Keywords/Search Tags:Text, IE programs, Over, Data, Evolving, Solutions
Related items