Improving and predicting performance of statistical language models in sparse domains

Posted on:1999-03-05

Degree:Ph.D

Type:Thesis

University:Boston University

Candidate:Iyer, Rukmini M

Full Text:PDF

GTID:2468390014967796

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific test are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by using out-of-domain test data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this thesis introduces two approaches that compensate for multi-domain differences, both representing "style" by part-of-speech (POS) sequences and "content" by the particular choice of words. First, data from multiple domains is combined using similarity weighting schemes that discriminate for content and style relevance prior to pooling multi-domain text. Second, n-gram distributions from multiple domains are combined, via a POS-dependent n-gram framework that separately compensate for word and POS usage differences. Two variations are explored: explicitly transforming the out-of-domain distribution before combining with an in-domain model, and separately estimating components of the POS-dependent n-gram model using multi-domain data. Finally, measures to analyze and predict recognition performance of language models are also investigated, resulting in an algorithm for predicting performance differences associated with localized changes in language models given a recognition system.;Experiments are mainly based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. However, portability of the techniques developed in this thesis is evaluated by additional experiments on a Spanish task. Both the data and distribution combination approaches lead to a 3-5% improvement in recognition performance over a domain-specific model, demonstrating larger gains than that obtained with previous approaches and the biggest gain from language modeling advances reported thus far on the Switchboard task. Furthermore, the new performance predictor demonstrates a 0.96 correlation with recognition performance compared to 0.83 for the existing perplexity measure, while providing a diagnostic of weaknesses of the language model under consideration. Results from this thesis impact the rapid development of new applications of speech and language technology, ranging from speech to handwriting recognition and from language transcription to understanding and translation.

Keywords/Search Tags:

Language, Performance, N-gram, Recognition, Data, Domains

PDF Full Text Request

Related items

1	Researching And Building Of The Mongolian Large Vocabulary Independent Continuous Speech Recognition System
2	Research On N-gram Based Hierarchical Text Language Identification
3	Research Of Continuous Chinese Sign Language Recognition Based On N-gram And Syntactic Models
4	The Optimization And Implementation Of The Efficiency And Performance Of Chinese Language Model Based On Recurrent Neural Network
5	Application Research On Statistical Language Model Of Large Vocabulary Continuous Speech Recognition System
6	Research On Language Recognition Technology Based On Coordination Information
7	Research On Statistical Language Model Of Large-Vocobulary Continuous Speech Recognition System
8	Language-independent text learning with statistical n-gram language models
9	Research And System Implementation Of Uyghur Text Classification Based On N-gram
10	Study On The Kazakh Named Entity Recognition Method Based On N-gram Model