Learning probabilistic lexicalized grammars for natural language processing

Posted on:2002-06-09

Degree:Ph.D

Type:Thesis

University:Harvard University

Candidate:Hwa, Rebecca

Full Text:PDF

GTID:2468390011996701

Subject:Computer Science

Abstract/Summary:

A good representation of language is essential to building natural language processing (NLP) applications. In recent years, the growing availability of machine-readable text corpora has popularized the use of corpus-trained probabilistic grammars to represent languages in NLP systems. Although automatically inducing grammars from large corpora is an appealing idea, it faces several challenges. First, the trained grammar must capture the complexity and ambiguities inherent in human languages. Second, to be useful in practical applications, the grammar must be computationally tractable. Third, although there exists an abundance of raw text, the induction of high-quality grammars depends on manually annotated training data, which are scarce; therefore, the learning algorithm must be able to generalize well. Finally, there are inherent trade-offs in attempting to meet all three challenges; a meta-level challenge is to find a good compromise between the competing factors.; To address these challenges, this thesis presents a partially supervised induction algorithm based on the Expectation-Maximization principle for the Probabilistic Lexicalized Tree Insertion Grammar (PLTIG) formalism. Using the lexical properties of the PLTIG formalism in the learning algorithm, we show that it is possible to automatically induce a grammar for a natural language that adequately resolves ambiguities and manages domain complexity at a tractable computational cost. By augmenting the basic learning algorithm with training techniques such as grammar adaptation and sample-selection, we show that the induction's dependency on annotated training data can be significantly reduced. Our empirical studies indicate that even with a 36% reduction in annotated training data, the learning algorithm can nonetheless induce grammars without degrading their quality.

Keywords/Search Tags:

Natural language, Grammar, Annotated training data, Learning algorithm, Probabilistic

Related items

1	Evaluating grammar formalisms for applications to natural language processing and biological sequence analysis
2	Deep Learning Based Automatic Grammer Error Correction
3	Restricted Natural Language Query Interface Based On Semantic Dependence Grammar Analysis Model
4	The Methodology And Implementation Of Chinese Natural Language Query In Databases
5	Research On Key Technologies For Natural Language Understanding
6	Maximizing resources for corpus-based natural language processing
7	Research On Machine Learning For Natural Language Processing And Transmission
8	Analysis And Research Of Probabilistic CYK Algorithm
9	The translator's assistant: A multilingual natural language generator based on linguistic universals, typologies, and primitives
10	Self-training Algorithm Based On Fast Search Of Natural Neighbors