Knowledge integration into language models: A random forest approach

Posted on:2010-04-17

Degree:Ph.D

Type:Dissertation

University:The Johns Hopkins University

Candidate:Su, Yi

Full Text:PDF

GTID:1448390002977034

Subject:Statistics

Abstract/Summary:

A language model (LM) is a probability distribution over all possible word sequences. It is a vital component of many natural language processing tasks, such as automatic speech recognition, statistical machine translation, information retrieval and so on. The art of language modeling has been dominated by a simple yet powerful model family, the n-gram language models. Many attempts have been made to go beyond n-grams either by proposing a new mathematical framework or by integrating more knowledge of human language, preferably both. The random forest language model (RFLM)---a collection of randomized decision tree language models---has distinguished itself as a successful effort of the former kind; we explore its potential of the latter.;We start our quest by advancing our understanding of the RFLM through explorative experimentation. To facilitate further investigation, we address the problem of training the RFLM on large amount of data through an efficient disk swapping algorithm. We formalize our method of integrating various knowledge sources into language models with random forests and illustrate its applicability with three innovative applications: morphological LMs of Arabic, prosodic LMs for speech recognition and combination of syntactic and topic information in LMs.

Keywords/Search Tags:

Language, Random

Related items

1	Spoken Language Understanding Research Based On Conditional Random Fields
2	Research On Morpheme Analysis Based On Conditional Random Fields In Chinese Natural Language Understanding
3	Dai Language Segmentation Based On Dictionary And Statistics
4	Research On Minority Language Recognition WEKA Platform And Multi-classifier
5	Research On The Detection Method Of Suicidal Ideation In Chinese Microblog Based On Language Features
6	A Study On Chinese Location Names Recognition Based On Conditional Random Fields
7	Research On Short Utterance Semantic Recognition Method Based On Cascaded Conditional Random Fields
8	FPGA Implementation Of Pseudo-random Sequence Generator Based On Hyperchaos
9	The Research Of Random Generator And The Design Of Randomization Test System
10	Research On Chinese Sign Language Recognition With Kinect