Font Size: a A A

Hierarchical statistical language models for unlimited vocabularies

Posted on:2004-06-11Degree:Ph.DType:Thesis
University:The University of RochesterCandidate:Galescu, LucianFull Text:PDF
GTID:2468390011476635Subject:Computer Science
Abstract/Summary:
Statistical language models have made a crucial contribution towards the development of large vocabulary speech recognition systems with acceptable performance, yet leave much to be desired as to their linguistic adequacy. Severe limitations on the recognition vocabulary and on the relations between words make such models of language both linguistically unacceptable and technologically insufficient. It is now well understood that further progress is only possible by increasing the linguistic knowledge encoded in a language model, while at the same time retaining much of the efficiency associated with the simplicity of purely statistical, collocation-based models.; In this thesis we focus on two related drawbacks of current statistical language models: that they don't account for the hierarchical structure inherent in language, and that they are not general enough to account for and integrate novel lexical items. To solve these problems, we propose extending the n-gram model paradigm to allow for multi-level language modeling, where levels correspond to different granularities in segmenting the input. Thus, whereas typical language models limit themselves to the lexical level, we think it would be advantageous for a language model to include phrase-level and sub-lexical level information as well. This would provide means of incorporating linguistic structural knowledge into statistical language models that are still easy to build and efficient to use.; We give particular emphasis to the sub-lexical level, which so far has been much less explored than the supra-lexical, phrase level. We propose a technique to find informative sub-lexical units and a model of word formation based on these units. Integrating this model with a lexical-level language model gives us a powerful mechanism to tackle the difficult problem of recognizing new words on the fly.; Throughout the thesis we present detailed evaluations of the new techniques proposed. Based on the results thus obtained, we claim to have laid the groundwork for new directions of research that will benefit not only the large vocabulary speech recognition research community, but also the areas of speech synthesis and dialogue systems.
Keywords/Search Tags:Language, Vocabulary, Speech, Recognition
Related items