Random forests and the data sparseness problem in language modeling

Posted on:2006-01-02

Degree:Ph.D

Type:Dissertation

University:The Johns Hopkins University

Candidate:Xu, Peng

Full Text:PDF

GTID:1458390008472318

Subject:Engineering

Abstract/Summary:

Language modeling is the problem of predicting words based on histories containing words already seen. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The data sparseness problem associated with language modeling arises from these two aspects. Although works have been done in both aspects separately, few have shown solutions that aim at them at the same time.; We explore the use of Random Forests (RFs) in language modeling to deal with the two key aspects jointly. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new technique is complementary to many of the existing techniques dealing with data sparseness problem.; After presenting our approach to efficient DT construction, we study our RF approach in the context of n-gram type language modeling in which n-1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories have more than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.; The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).

Keywords/Search Tags:

Language, Data sparseness problem, Words, Technique, Aspects

Related items

1	Research On The DNA Words And Its Arithmetic Actualize
2	Research On Optimization Algorithm For Dataset Covering Problem
3	Natural Language Processing, Words Related To Knowledge No Guide For Build And Balanced Classifier
4	Study Of Application Of A Language Model Combining Statistics And Rules In Chinese Input Method
5	Application And Research Of Statistical Language Model
6	WordNet Based Multi Aspects Sentimental Summarization Of Institution's Reviews
7	Research And Application Of Data Sparseness Problem In Collaborative Filtering Recommenderation
8	Nounal Polysemous Words Discriminance In NLU And Application In Intelligent Instruments
9	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
10	Communication beyond words: Multimedia approaches to bridging language disabilities and barriers