Font Size: a A A

Automatic thesaurus discovery via selective Natural Language Processing: A corpus based approach

Posted on:1994-01-28Degree:Ph.DType:Thesis
University:University of PittsburghCandidate:Grefenstette, Gregory ThomasFull Text:PDF
GTID:2478390014492438Subject:Computer Science
Abstract/Summary:
The principal problem with information management today is organizing the ever-widening body of electronically available text. Manual techniques for filtering and structuring such information are useful for sifting through a collection of texts, but manual approaches cannot keep pace with the quantity and variety of text generated. Outside of well-funded fields such as law and medicine, there is little availability of any techniques other than simple word and stem matching for wading through this information. Such string matching techniques are thwarted, however, by the language variability problem, in which a similar idea is expressed by a variety of different words.; We defend the thesis that selective Natural Language Processing, applying subsets of known language processing techniques, over a collection of texts provides enough information to create equivalence classes between different terms, thus easing the problem of language variability. We present a method using partial syntactic analysis that allows creation of equivalence classes over any body of text and we show that the classes created by this method are more like manually-created classes than those created by document co-occurrence, sentence co-occurrence and window-based equivalence class creation techniques. Results of applying this method to information retrieval, thesaurus enrichment, and creation of automatic thesauri are also presented.; The main contributions of this thesis are the following. We describe a robust domain-independent partial parser for English which yields local syntactic contexts of words. We produce a method for using this context to create corpus-dependent similarity lists. We demonstrate that the similarities extracted by this method correspond to human similarity judgments by comparison with psychological data and by showing the overlap with manually created thesauri. We demonstrate that the overlap with manual thesauri using this syntactic context is greater than that obtained by traditional textual windowing techniques. We develop evaluation methods applicable to any corpus-based meaning extraction techniques: artificial synonyms, and gold standards measurements. We show applications of our similarity discovery techniques to information retrieval, thesaurus enrichment, and automatic thesaurus construction.
Keywords/Search Tags:Techniques, Information, Thesaurus, Language processing, Automatic
Related items