Font Size: a A A

Using graphs and random walks for discovering latent semantic relationships in text

Posted on:2008-06-21Degree:Ph.DType:Dissertation
University:University of MichiganCandidate:Erkan, GunesFull Text:PDF
GTID:1448390005452132Subject:Computer Science
Abstract/Summary:
We propose a graph-based representation of text collections where the nodes are textual units such as sentences or documents, and the edges represent the pairwise similarity function between these units. We show how random walks on such a graph can give us better approximations for the latent similarities between two natural language strings. We also derive algorithms based on random walk models to rank the nodes in a text similarity graph to address the text summarization problem in information retrieval. The similarity functions used in the graphs are intentionally chosen to be very simple and language-independent to make our methods as generic as possible, and to show that significant improvements can be achieved even by starting with such similarity functions. We put special emphasis on language modeling-based similarity functions since we use them for the first time on problems such as document clustering and classification, and get improved results compared to the classical similarity functions such as cosine. Our graph-based methods are applicable to a diverse set of problems including generic and focused summarization, document clustering, and text classification. The text summarization system we have developed has ranked as one of the top systems in Document Understanding Conferences over the past few years. In document clustering and classification, using language modeling functions performs consistently better than using the classical cosine measure reaching as high as 25% improvement in accuracy. Random walks on the similarity graph achieve additional significant improvements on top of this. We also revisit the nearest neighbor text classification methods and derive semi-supervised versions by using random walks that rival the state-of-the-art classification algorithms such as Suppor Vector Machines.
Keywords/Search Tags:Random walks, Text, Using, Graph, Classification, Similarity functions, Document
Related items