Font Size: a A A

Non-negative matrix factorization approach to language model adaptation

Posted on:2003-06-20Degree:Ph.DType:Thesis
University:Rutgers The State University of New Jersey - New BrunswickCandidate:Novak, MiroslavFull Text:PDF
GTID:2468390011479983Subject:Engineering
Abstract/Summary:
Statistical language modeling attempts to model the extremely complicated process of natural language generation. In statistical language modeling, large amounts of text are used to automatically determine the model's parameters. Language modeling can be used in automatic speech recognition, machine translation, natural language processing, etc.; N-gram base language models are widely used because they are easy to train and perform relatively well, as long as the domain of their use is well represented in the training corpus. The n-gram models exploit the immediate past only (typically the last two words), ignoring longer dependencies between the words. They also require a large amount of training data if they are to be adapted for a specific domain.; In this thesis the use of Latent Semantic Analysis (LSA) in language modeling is explored. LSA is a technique based on lower rank matrix approximation of a word-document co-occurrence matrix, which captures long-distance dependencies in natural language, and is used to enhance the n-gram model. It can be viewed as either a long-distance dependency model or as an adaptation technique, depending on the size of the exploited past.; Use of Non-negative Matrix Factorization (NMF) is proposed as an alternative to the Singular Value Decomposition, which has been traditionally used in other research work. The NMF technique allows one to formulate the task of long-distance dependency language modeling in the space of probability distributions. This is a more natural technique in comparison to SVD, which requires some heuristic method to convert the result of a linear transformation into the probability space. It is shown that the NMF-based Language Model (LM) is more compact and robust when compared to an SVD-based LM. Furthermore, unlike SVD-based LMs that typically work with the word-document matrix, we propose building an NMF-based LM by either using the word-document matrix or the word co-occurrence matrix. Our construction of the word co-occurrence matrix also does not use explicit segmentation of the training text into documents. The new language model, combination of a tri-gram and NMF model, showed improvement of the test set perplexity up to 25% in comparison to the tri-gram model.
Keywords/Search Tags:Model, Language, Matrix, NMF
Related items