Font Size: a A A

Knowledge integration into language models: A random forest approach

Posted on:2010-04-17Degree:Ph.DType:Dissertation
University:The Johns Hopkins UniversityCandidate:Su, YiFull Text:PDF
GTID:1448390002977034Subject:Statistics
Abstract/Summary:
A language model (LM) is a probability distribution over all possible word sequences. It is a vital component of many natural language processing tasks, such as automatic speech recognition, statistical machine translation, information retrieval and so on. The art of language modeling has been dominated by a simple yet powerful model family, the n-gram language models. Many attempts have been made to go beyond n-grams either by proposing a new mathematical framework or by integrating more knowledge of human language, preferably both. The random forest language model (RFLM)---a collection of randomized decision tree language models---has distinguished itself as a successful effort of the former kind; we explore its potential of the latter.;We start our quest by advancing our understanding of the RFLM through explorative experimentation. To facilitate further investigation, we address the problem of training the RFLM on large amount of data through an efficient disk swapping algorithm. We formalize our method of integrating various knowledge sources into language models with random forests and illustrate its applicability with three innovative applications: morphological LMs of Arabic, prosodic LMs for speech recognition and combination of syntactic and topic information in LMs.
Keywords/Search Tags:Language, Random
Related items