Font Size: a A A

Toward concept-based text understanding and mining

Posted on:2006-08-22Degree:Ph.DType:Thesis
University:University of Illinois at Urbana-ChampaignCandidate:Li, XinFull Text:PDF
GTID:2458390008975655Subject:Artificial Intelligence
Abstract/Summary:
There is a huge amount of text information in the world, written in natural languages. Understanding and effectively utilizing the text requires the ability to disambiguate text fragments at several levels, syntactically and semantically, abstracting away details and using background knowledge in a variety of ways. One promising direction of understanding text in the real semantical sense, which is a natural ability of human beings, and that of supporting intelligent access to textual information is to implement concept-based text understanding and mining. That is, a mechanism of organizing, indexing, accessing textual information and discovering knowledge, centered around real-world concepts and entities. Unfortunately, due to the difficulty caused by language ambiguity, most current text-related techniques still directly deal with syntactic fragments and individual mentions of concepts, without considering a concept as a whole.;A critical problem with these techniques is the lack of the capability to resolve the concept ambiguity in text. A given entity---representing a person, a location or an organization---may be mentioned in text in multiple, ambiguous ways. Supporting concept-based natural language understanding requires resolving conceptual ambiguity through Entity Reference Identification , and in particular, identifying entities from their mentions, within and across documents, and mapping mentions to them, with the hope to discover and organize information around identified entities.;This thesis systematically studies this fundamental problem towards concept-based text understanding and mining. We develop several machine learning techniques to address different aspects of it, including (1) a discriminative approach of learning similar metrics to capture the appearance similarity between names; (2) a new supervised discriminative clustering framework, that can partition a set of names through some global optimization and incorporate learning into clustering, guided by supervision; and (3) a generative probabilistic model, at the heart of which is a view on how documents are generated and how names (of different entity types) are "sprinkled" into them. We show that all of these approaches perform very accurately, in the range of 90%--95% F1 measure for different entity types, better than baselines and previous approaches to (some aspects of) this problem. Our work also exhibits that, as more domain-specific knowledge is discovered and incorporated into the entity identification, the learning techniques developed accordingly, can achieve better performance.;In addition to entity identification, we also extend the generative probabilistic model to address a significant application that are related to concept-based text understanding and mining---semantic integration between text and databases, based on entity identification and tracking.
Keywords/Search Tags:Text, Entity identification, Information
Related items