Non-negative matrix factorization approach to language model adaptation

Posted on:2003-06-20

Degree:Ph.D

Type:Thesis

University:Rutgers The State University of New Jersey - New Brunswick

Candidate:Novak, Miroslav

Full Text:PDF

GTID:2468390011479983

Subject:Engineering

Abstract/Summary:

Statistical language modeling attempts to model the extremely complicated process of natural language generation. In statistical language modeling, large amounts of text are used to automatically determine the model's parameters. Language modeling can be used in automatic speech recognition, machine translation, natural language processing, etc.; N-gram base language models are widely used because they are easy to train and perform relatively well, as long as the domain of their use is well represented in the training corpus. The n-gram models exploit the immediate past only (typically the last two words), ignoring longer dependencies between the words. They also require a large amount of training data if they are to be adapted for a specific domain.; In this thesis the use of Latent Semantic Analysis (LSA) in language modeling is explored. LSA is a technique based on lower rank matrix approximation of a word-document co-occurrence matrix, which captures long-distance dependencies in natural language, and is used to enhance the n-gram model. It can be viewed as either a long-distance dependency model or as an adaptation technique, depending on the size of the exploited past.; Use of Non-negative Matrix Factorization (NMF) is proposed as an alternative to the Singular Value Decomposition, which has been traditionally used in other research work. The NMF technique allows one to formulate the task of long-distance dependency language modeling in the space of probability distributions. This is a more natural technique in comparison to SVD, which requires some heuristic method to convert the result of a linear transformation into the probability space. It is shown that the NMF-based Language Model (LM) is more compact and robust when compared to an SVD-based LM. Furthermore, unlike SVD-based LMs that typically work with the word-document matrix, we propose building an NMF-based LM by either using the word-document matrix or the word co-occurrence matrix. Our construction of the word co-occurrence matrix also does not use explicit segmentation of the training text into documents. The new language model, combination of a tri-gram and NMF model, showed improvement of the test set perplexity up to 25% in comparison to the tri-gram model.

Keywords/Search Tags:

Model, Language, Matrix, NMF

Related items

1	The Design And Realization Of Programming Model Language
2	Researches On 3d Model Retrieval Based On XML Technology And Semantic Matrix
3	The Optimization And Implementation Of The Efficiency And Performance Of Chinese Language Model Based On Recurrent Neural Network
4	Research On The Language Model Based Information Retrieval System
5	Construction And Application Of Computable General Equilibrium Model Based On GAMS Language
6	Application Research On Statistical Language Model Of Large Vocabulary Continuous Speech Recognition System
7	Identification, Based On The Language Of The Gmm-ubm Model
8	Researching Of The Mogolian Language Model Based On Speech Recognition
9	Study Of Application Of A Language Model Combining Statistics And Rules In Chinese Input Method
10	Based On The Characteristics Of Cv Syllable Minority Language Recognition Research