Font Size: a A A

Clustering Web documents: A phrase-based method for grouping search engine results

Posted on:2000-05-12Degree:Ph.DType:Dissertation
University:University of WashingtonCandidate:Zamir, Oren EliFull Text:PDF
GTID:1468390014964695Subject:Computer Science
Abstract/Summary:
This dissertation investigates whether the automatic grouping of similar documents (document clustering) is a feasible method of presenting the results of Web search engines. We identify several key requirements for document clustering of search engine results: clustering quality, concise and accurate cluster descriptions, and speed. In response, we present a novel clustering algorithm---Suffix Tree Clustering (STC) specifically designed for this task in several respects. First, STC groups documents based on shared phrases. Second, it allows overlapping clusters. Finally, STC is a fast, incremental, and linear-time (in the number of documents) algorithm. We have evaluated the clustering quality of STC and showed it superior to other commonly used algorithms on Web search engine results. We have also shown STC to be significantly faster than other linear-time algorithms when clustering search engine snippets. We have incorporated STC into the Grouper system, which is available at http:// vww.cs.washington.edu/research/clustering . Grouper is, to our knowledge, the first implementation of a post-retrieval document-clustering interface to a Web search engine.; Inspired by STC's success, we studied the use of phrases for clustering across a variety of clustering algorithms. Phrases contain more information than single words (information regarding proximity and order of words) and have the equally important advantage of having a higher descriptive power. We have shown that using phrases substantially improves the performance of all algorithms tested including hierarchical group-average, k-means, single pass, Fractionation and Buckshot.
Keywords/Search Tags:Clustering, Search engine, Documents, Results, STC, Web, Phrases, Algorithms
Related items