Language-independent text learning with statistical n-gram language models

Posted on:2004-03-18

Degree:Ph.D

Type:Thesis

University:University of Waterloo (Canada)

Candidate:Peng, Fuchun

Full Text:PDF

GTID:2468390011472532

Subject:Computer Science

Abstract/Summary:

In this thesis, we attempt to build language independent text learning systems that do not require significant human intervention. Our solution is based on statistical n-gram language modeling and unsupervised machine learning. Statistical language modeling is concerned with estimating the probability of word sequences, which provides a natural and principled approach to text learning. Statistical n-gram language models model text as a sequence of characters or words and offer the advantage of language independence. Unsupervised machine learning offers the advantage of significantly reducing human labor. We focus on improving performance on three text learning problems by building statistical n-gram language models and by exploiting the value of un-labeled data. These tasks include language and task independent text classification, language independent lexical learning and unsupervised word segmentation, and Chinese text retrieval.; The first task we consider is language and task independent text classification. We generalize the naive Bayes classifier by incorporating local dependencies between observations (characters or words) with a Markov chain. The result is a C&barbelow;hain A&barbelow;ugmented N&barbelow;aive (CAN) Bayes classifier. By exploiting sophisticated smoothing techniques from language modeling research, we can deal with the feature explosion problem efficiently in CAN models. Our language modeling based approach avoids many of the ad hoc and error accumulating aspects of feature selection. The approach can work both at the character and word level.; The second task we consider is language independent lexical learning and unsupervised word segmentation. We propose an unsupervised lexicon construction approach by applying an iterative optimization procedure (the EM algorithm) to n-gram models. We then segment new sequences based on dynamic programming (the Viterbi algorithm). Our approach removes the necessity of manual dictionary construction and allows for adaptive segmentations based on context. To reduce the sparse data and local maxima problems that occur with the traditional EM algorithm, we use a modified EM algorithm that exploits hierarchical structure and employs a self-supervised estimation strategy.; Finally, we apply unsupervised word segmentation to Chinese text retrieval. Despite its low segmentation accuracy relative to supervised methods, our unsupervised segmentation method has many advantages over traditional word segmentation approaches. Experiments on the TREC-5 and TREC-6 data sets show that unsupervised word segmentation often outperforms traditional segmentation methods in Chinese text retrieval. We also find that the relationship between word segmentation performance and retrieval performance is not monotonic, as commonly expected. (Abstract shortened by UMI.)...

Keywords/Search Tags:

Language, Text, Word segmentation, CAN, Statistical, EM algorithm, Models, -gram

Related items

1	Research On Chinese Word Segmentation Algorithm Based On News Text
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Research On News Text Summarization Algorithm Based On Pre-trained Language Model
4	Research On Decoding Methods Of Statistical Language Models
5	Statistical Learning In Chinese Word Segmentatin And Application-specific Segmentation
6	Research On Jointly Learning Word Embeddings And Latent Topics In Text
7	Research On Chinese Web Information Retrieval Of Model Based On Statistical Language
8	Research On Spelling Checker/Corrector For Kazakh Corpora
9	N-gram Technology Application Study In Computer Processing Of Chinese Language
10	The Research On Chinese Word Segmentation System Based On SVM