Font Size: a A A

Evaluating the performance of latent semantic indexing

Posted on:2006-12-04Degree:Ph.DType:Dissertation
University:University of Colorado at BoulderCandidate:Suwannajan, PakineeFull Text:PDF
GTID:1458390005993080Subject:Computer Science
Abstract/Summary:
Information Retrieval (IR) has emerged in various fields such as the Web, bibliography systems, and digital libraries. Data indexing and retrieval are parts of IR and have been of interest to computer information scientists in the past years. One of the most popular IR models is the vector space model. It was developed to solve many problems associated with exact lexical matching. The vector space model employs linear algebra tools to find the similarity between a document and a query. Latent Semantic Indexing (LSI), a widely used variant of the vector space model, was designed to overcome problems arising from synonymy and polysemy. It is often claimed in the literature that LSI outperforms the vector space model. We discovered that LSI's performance is better than that of the vector space model only in some cases, specifically when the amount of information that a query shares with the relevant documents is greater than the amount that that query shares with the non-relevant documents. We also studied the capability of LSI in solving synonymy and polysemy problems. While synonyms are words that have the same meaning, a polyseme is a single word that has multiple meanings. We discovered that LSI can distinguish between two synonymous words only when they both appear in the same or similar contexts. For polysemy, LSI outperforms the vector space model only when two contexts that use different meanings of a polyseme share at least some information.
Keywords/Search Tags:Vector space model, Information, LSI
Related items