Font Size: a A A

Toward knowledge-free induction of machine-readable dictionaries

Posted on:2002-09-04Degree:Ph.DType:Dissertation
University:University of Colorado at BoulderCandidate:Schone, Patrick JohnFull Text:PDF
GTID:1468390011491373Subject:Computer Science
Abstract/Summary:
Machine-readable dictionaries (MRDs) have found uses in many natural language tasks. Current MRDs are typically generated either by the expensive processes of digitization from hard-copy dictionaries or construction by hand. It would be valuable if MRDs could be built automatically from a corpus of text. Moreover, if induction were knowledge free, it could be applied across ages, across domains, and perhaps even to non-language problems.; In this research, we focus on three major subtasks of language-independent, near-knowledge-free induction of MRDs. In particular, we concentrate on inducing dictionary headwords, identifying language morphologies, and clustering and labeling parts of speech. However, unlike past research efforts, our algorithms make use of no language-specific information and most, in fact, are completely knowledge-free.; To induce dictionary headwords, we identify nine currently available algorithm for multiword unit selection and, based on extensive comparisons to both static and dynamic gold standards, we isolate the algorithms that are best for our task. We then explore semantic non-compositionality and non-substitutivity as means of improving performance. We find that non-substitutivity provides modest improvements. Since not all languages are amenable to multiword unit processing, we also identify segmentation strategies and show hoe to enhance those strategies.; Next we introduce a new knowledge-free methodology for automatic acquisition of morphology. This approach makes use of character trees, singular value decomposition, induced syntactic constraints, orthographic information, and transitivity. We measure its performance in German, Dutch, and English and show that it outperforms other existing knowledge-free algorithms.; Using this morphological information and distributional information, we introduce a novel approach to clustering words based on syntactic usage. We proceed to automatically affix actual part-of-speech tags to those clusters without using any training data or lexicons. We couple language universals (our sole human input) with features extracted from the clusters and tag the clusters using a probabilistic framework.; A number of algorithms already exist for finding semantic relationships. We therefore conclude by describing such algorithms and by discussing how the components we have induced could be combined with existing strategies in order to yield actual definitions.
Keywords/Search Tags:Knowledge-free, Induction, Mrds
Related items