Font Size: a A A

Automatic acquisition of lexical semantic knowledge from large corpora: The identification of semantically related words, markedness, polarity, and antonymy

Posted on:1999-07-06Degree:Ph.DType:Dissertation
University:Columbia UniversityCandidate:Hatzivassiloglou, VasileiosFull Text:PDF
GTID:1468390014472641Subject:Computer Science
Abstract/Summary:
Lexical semantic knowledge is useful, even indispensable, for many natural language processing applications. Yet, traditional approaches for acquiring this knowledge manually are expensive and cannot easily handle the requisite domain dependence. In this dissertation, I address four closely related problems from lexical semantics, describing a fully automatic system that extracts information about semantic groups and scales from large free-text corpora. The system forms groups of semantically related terms such as {dollar}{lcub}{dollar}cold, warm, hot{dollar}{rcub}, {lcub}{dollar}final, preliminary{dollar}{rcub},{dollar} and {dollar}{lcub}{dollar}court, jury, law, regulation{dollar}{rcub}.{dollar} Using gradability indicators, it identifies those of the groups that are actually linguistic scales, i.e., contain terms that can be linearly ordered on the basis of semantic strength. Scalar groups are further partitioned into two subgroups according to evaluative orientation, distinguishing between positively loaded terms (e.g., beautiful, ingenious, unbiased) and their negative counterparts (e.g., ugly, plain, lazy). Finally, the semantic orientation of each subgroup is identified. Combining the above four stages results in an automatic method for the retrieval of possibly domain-dependent pairs of antonyms. All this information is actively learned from the corpus; the system does not access any type of stored information about words such as dictionaries, thesauri, or similar databases.; I have adopted a statistical approach that combines both supervised and unsupervised learning methods and is informed by linguistic models of the data and the tasks at hand. I rely on robust, non-parametric statistical methods; multiple knowledge sources justified by linguistic analyses; and shallow syntactic and morphological processing during information extraction. I describe and justify the linguistic sources, and present the results (sometimes quite unexpected) of experimental studies that are designed to validate related hypotheses made in the linguistics literature. I also present a novel evaluation method which simultaneously employs multiple reference models without inducing a single "best" model, and results produced for several collections of adjectives and nouns. Finally, I present evidence of strengths of the hybrid linguistic-statistical approach, and discuss applications of the system's output to language problems.
Keywords/Search Tags:Semantic, Related, Automatic, Linguistic
Related items