Font Size: a A A

Phrase-based vector space model in document retrieval

Posted on:2004-01-22Degree:Ph.DType:Dissertation
University:University of California, Los AngelesCandidate:Mao, WenleiFull Text:PDF
GTID:1468390011970677Subject:Computer Science
Abstract/Summary:
With the advent of the Internet and the World Wide Web, information distribution has become more convenient than ever. However, such an unprecedented abundance of information makes the location of a specific piece of information ever more difficult. Since most of the current search targets are text documents, we study the effective retrieval of text documents in this research.; Many document retrieval systems are based on the vector space model that represents a document as a vector of index terms. Concepts have been proposed to replace word stems as the index terms to improve the retrieval effectiveness. However, past research revealed that such system did not outperform the traditional stem-based systems. Incorporating conceptual similarity derived from knowledge sources should have the potential to improve retrieval effectiveness. Yet the incompleteness of the knowledge sources precludes significant improvement. To remedy this problem, we propose to represent documents using phrases. A phrase consists of a concept and several word stems. The similarity between two phrases is jointly determined by their conceptual similarity and their common word stems. The document similarity can in turn be derived from the phrase similarities.; We demonstrate that the phrase-based vector space model is more effective in document retrieval than the traditional stem-based vector space model. Significant effectiveness improvements are observed in both the exhaustive search and a cluster-based retrieval. We also show that such significant increase in retrieval effectiveness can be achieved without sacrificing too much efficiency.
Keywords/Search Tags:Vector space model, Retrieval, Document
Related items