Font Size: a A A

Challenges in managing information extraction

Posted on:2010-10-07Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Shen, Warren HFull Text:PDF
GTID:1448390002481721Subject:Computer Science
Abstract/Summary:
This dissertation studies information extraction (IE), the problem of extracting structured information from unstructured data. Example IE tasks include extracting person names from news articles, product information from e-commerce Web pages, street addresses from emails, and names of emerging music bands from blogs.IE is all increasingly important problem in a broad range of applications that seek to utilize the growing amount of unstructured data available today. Such applications include structured community Web portals, data integration systems, and data mining applications over text data. However, despite significant progress, managing IE and building end-to-end IE applications still involves many difficult challenges, including writing complex IE programs and optimizing them, deciding how to store and process the large amounts of data the IE applications manage, and executing and obtaining meaningful results from partially specified or approximate IE programs (e.g., during the development process, or in scenarios where an approximate result may already be sufficient).In this dissertation, we develop solutions to the key challenges mentioned above. First, we develop a declarative framework that can help make it easier for developers to write and understand IE programs, and show how to automatically optimize IE programs written in this framework to reduce runtime. Next, given that relational database systems (RDBMSs) were designed to store and process large data sets, we study the benefits and limitations of employing RDBMSs for storing and processing data in IE applications. Finally, we extend our declarative framework to enable best-effort IE, allowing developers to more easily write and refine approximate IE programs. A key idea underlying these solutions is that many of the principles behind RDBMSs for managing structured data can be extended to IE for managing unstructured data.
Keywords/Search Tags:Data, Managing, Information, IE programs, IE applications, Challenges
Related items