Font Size: a A A

Language-independent text learning with statistical n-gram language models

Posted on:2004-03-18Degree:Ph.DType:Thesis
University:University of Waterloo (Canada)Candidate:Peng, FuchunFull Text:PDF
GTID:2468390011472532Subject:Computer Science
Abstract/Summary:
In this thesis, we attempt to build language independent text learning systems that do not require significant human intervention. Our solution is based on statistical n-gram language modeling and unsupervised machine learning. Statistical language modeling is concerned with estimating the probability of word sequences, which provides a natural and principled approach to text learning. Statistical n-gram language models model text as a sequence of characters or words and offer the advantage of language independence. Unsupervised machine learning offers the advantage of significantly reducing human labor. We focus on improving performance on three text learning problems by building statistical n-gram language models and by exploiting the value of un-labeled data. These tasks include language and task independent text classification, language independent lexical learning and unsupervised word segmentation, and Chinese text retrieval.; The first task we consider is language and task independent text classification. We generalize the naive Bayes classifier by incorporating local dependencies between observations (characters or words) with a Markov chain. The result is a C&barbelow;hain A&barbelow;ugmented N&barbelow;aive (CAN) Bayes classifier. By exploiting sophisticated smoothing techniques from language modeling research, we can deal with the feature explosion problem efficiently in CAN models. Our language modeling based approach avoids many of the ad hoc and error accumulating aspects of feature selection. The approach can work both at the character and word level.; The second task we consider is language independent lexical learning and unsupervised word segmentation. We propose an unsupervised lexicon construction approach by applying an iterative optimization procedure (the EM algorithm) to n-gram models. We then segment new sequences based on dynamic programming (the Viterbi algorithm). Our approach removes the necessity of manual dictionary construction and allows for adaptive segmentations based on context. To reduce the sparse data and local maxima problems that occur with the traditional EM algorithm, we use a modified EM algorithm that exploits hierarchical structure and employs a self-supervised estimation strategy.; Finally, we apply unsupervised word segmentation to Chinese text retrieval. Despite its low segmentation accuracy relative to supervised methods, our unsupervised segmentation method has many advantages over traditional word segmentation approaches. Experiments on the TREC-5 and TREC-6 data sets show that unsupervised word segmentation often outperforms traditional segmentation methods in Chinese text retrieval. We also find that the relationship between word segmentation performance and retrieval performance is not monotonic, as commonly expected. (Abstract shortened by UMI.)...
Keywords/Search Tags:Language, Text, Word segmentation, CAN, Statistical, EM algorithm, Models, -gram
Related items