Font Size: a A A

Learning probabilistic lexicalized grammars for natural language processing

Posted on:2002-06-09Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Hwa, RebeccaFull Text:PDF
GTID:2468390011996701Subject:Computer Science
Abstract/Summary:
A good representation of language is essential to building natural language processing (NLP) applications. In recent years, the growing availability of machine-readable text corpora has popularized the use of corpus-trained probabilistic grammars to represent languages in NLP systems. Although automatically inducing grammars from large corpora is an appealing idea, it faces several challenges. First, the trained grammar must capture the complexity and ambiguities inherent in human languages. Second, to be useful in practical applications, the grammar must be computationally tractable. Third, although there exists an abundance of raw text, the induction of high-quality grammars depends on manually annotated training data, which are scarce; therefore, the learning algorithm must be able to generalize well. Finally, there are inherent trade-offs in attempting to meet all three challenges; a meta-level challenge is to find a good compromise between the competing factors.; To address these challenges, this thesis presents a partially supervised induction algorithm based on the Expectation-Maximization principle for the Probabilistic Lexicalized Tree Insertion Grammar (PLTIG) formalism. Using the lexical properties of the PLTIG formalism in the learning algorithm, we show that it is possible to automatically induce a grammar for a natural language that adequately resolves ambiguities and manages domain complexity at a tractable computational cost. By augmenting the basic learning algorithm with training techniques such as grammar adaptation and sample-selection, we show that the induction's dependency on annotated training data can be significantly reduced. Our empirical studies indicate that even with a 36% reduction in annotated training data, the learning algorithm can nonetheless induce grammars without degrading their quality.
Keywords/Search Tags:Natural language, Grammar, Annotated training data, Learning algorithm, Probabilistic
Related items