Similarity-based approaches to natural language processing

Posted on:1998-07-13

Degree:Ph.D

Type:Thesis

University:Harvard University

Candidate:Lee, Lillian Jane

Full Text:PDF

GTID:2468390014474213

Subject:Computer Science

Abstract/Summary:

Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity.; Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering.; Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction.

Keywords/Search Tags:

Approaches, Events, Language, Information, Words

Related items

1	Communication beyond words: Multimedia approaches to bridging language disabilities and barriers
2	Natural Language Processing, Words Related To Knowledge No Guide For Build And Balanced Classifier
3	Study Of Application Of A Language Model Combining Statistics And Rules In Chinese Input Method
4	Hot Event Detection And Analysis Based On Internet Information
5	The Research Of The Detection Approaches Of RFID Composite Events Based On Restricted Chronicle
6	Semantics-based language models for information retrieval and text mining
7	Language modeling approaches to question answering
8	Nounal Polysemous Words Discriminance In NLU And Application In Intelligent Instruments
9	Research On Words Segmentation Algorithm And Word Variant Extraction Method Of Message Variety Based
10	Research On Joint Recognition Of Trigger Words And Attributes In Biomedical Events