Automatic thesaurus discovery via selective Natural Language Processing: A corpus based approach

Posted on:1994-01-28

Degree:Ph.D

Type:Thesis

University:University of Pittsburgh

Candidate:Grefenstette, Gregory Thomas

Full Text:PDF

GTID:2478390014492438

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The principal problem with information management today is organizing the ever-widening body of electronically available text. Manual techniques for filtering and structuring such information are useful for sifting through a collection of texts, but manual approaches cannot keep pace with the quantity and variety of text generated. Outside of well-funded fields such as law and medicine, there is little availability of any techniques other than simple word and stem matching for wading through this information. Such string matching techniques are thwarted, however, by the language variability problem, in which a similar idea is expressed by a variety of different words.; We defend the thesis that selective Natural Language Processing, applying subsets of known language processing techniques, over a collection of texts provides enough information to create equivalence classes between different terms, thus easing the problem of language variability. We present a method using partial syntactic analysis that allows creation of equivalence classes over any body of text and we show that the classes created by this method are more like manually-created classes than those created by document co-occurrence, sentence co-occurrence and window-based equivalence class creation techniques. Results of applying this method to information retrieval, thesaurus enrichment, and creation of automatic thesauri are also presented.; The main contributions of this thesis are the following. We describe a robust domain-independent partial parser for English which yields local syntactic contexts of words. We produce a method for using this context to create corpus-dependent similarity lists. We demonstrate that the similarities extracted by this method correspond to human similarity judgments by comparison with psychological data and by showing the overlap with manually created thesauri. We demonstrate that the overlap with manual thesauri using this syntactic context is greater than that obtained by traditional textual windowing techniques. We develop evaluation methods applicable to any corpus-based meaning extraction techniques: artificial synonyms, and gold standards measurements. We show applications of our similarity discovery techniques to information retrieval, thesaurus enrichment, and automatic thesaurus construction.

Keywords/Search Tags:

Techniques, Information, Thesaurus, Language processing, Automatic

PDF Full Text Request

Related items

1	Research On Automatic Construction Of Natural Language Thesaurus
2	Research In Thesaurus-based Ontology Building Method
3	Design Of Automatic Term Extraction System And Study Of Key Techniques
4	Information security applications of natural language processing techniques
5	An automatic feedback thesaurus approach and its parallel implementations
6	Automatic Supervised Thesauri Construction with 'Roget's Thesaurus'
7	The Automatic Formation Of Conversion From Chinese Thesaurus To Ontology
8	Discussion On The Classification Principles And Framework Of Traditional Chinese Medicine Language System's Meta-thesaurus
9	Resolving quasi-synonym relationships in automatic thesaurus construction using fuzzy rough sets and an inverse term frequency similarity function
10	Experiments with automatic indexing and a relational thesaurus in a Chinese information retrieval system