Font Size: a A A

Random forests and the data sparseness problem in language modeling

Posted on:2006-01-02Degree:Ph.DType:Dissertation
University:The Johns Hopkins UniversityCandidate:Xu, PengFull Text:PDF
GTID:1458390008472318Subject:Engineering
Abstract/Summary:
Language modeling is the problem of predicting words based on histories containing words already seen. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The data sparseness problem associated with language modeling arises from these two aspects. Although works have been done in both aspects separately, few have shown solutions that aim at them at the same time.; We explore the use of Random Forests (RFs) in language modeling to deal with the two key aspects jointly. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new technique is complementary to many of the existing techniques dealing with data sparseness problem.; After presenting our approach to efficient DT construction, we study our RF approach in the context of n-gram type language modeling in which n-1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories have more than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.; The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).
Keywords/Search Tags:Language, Data sparseness problem, Words, Technique, Aspects
Related items