Font Size: a A A

ParaMor: From paradigm structure to natural language morphology induction

Posted on:2009-08-29Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Monson, ChristianFull Text:PDF
GTID:2448390005950942Subject:Artificial Intelligence
Abstract/Summary:
Most of the world's natural languages have complex morphology. But the expense of building morphological analyzers by hand has prevented the development of morphological analysis systems for the large majority of languages. Unsupervised induction techniques, that learn from unannotated text data, can facilitate the development of computational morphology systems for new languages. Such unsupervised morphological analysis systems have been shown to help natural language processing tasks including speech recognition (Creutz, 2006) and information retrieval (Kurimo and Turunen, 2008). This thesis describes ParaMor, an unsupervised induction algorithm for learning morphological paradigms from large collections of words in any natural language Paradigms are sets of mutually substitutable morphological operations that organize the inflectional morphology of natural languages. ParaMor focuses on the most common morphological process, suffixation.;ParaMor learns paradigms in a three-step algorithm. First, a recall-centric search scours a space of candidate partial paradigms for those which possibly model suffixes of true paradigms. Second, ParaMor merges selected candidates that appear to model portions of the same paradigm. And third, ParaMor discards those clusters which most likely do not model true paradigms. Based on the acquired paradigms, ParaMor then segments words into morphemes. ParaMor, by design, is particularly effective for inflectional morphology, while other systems, such as Morfessor (Creutz, 2006), better identify derivational morphology. This thesis leverages the complementary strengths of ParaMor and Morfessor by adjoining the analyses from the two systems.;ParaMor and its combination with Morfessor participated in Morpho Challenge, a peer operated competition for morphology analysis systems (Kurimo, Turunen, and Varjokallio, 2008). The Morpho Challenge competitions held in 2007 and 2008 evaluated each system's morphological analyses in five languages, English, German, Finnish, Turkish, and Arabic. When ParaMor's morphological analyses are merged with those of Morfessor, the resulting morpheme recall in all five languages is higher than that of any system which competed in either year's Challenge; in Turkish, for example, ParaMor's recall of 52.1% is twice that of the next highest system. This strong recall leads to F1 scores for morpheme identification above that of all systems in all languages but English.
Keywords/Search Tags:Natural language, Morphology, Paramor, Morphological, Systems
Related items