Font Size: a A A

A comparative study of keyphrase-based query-specific clustering on WWW

Posted on:2005-04-29Degree:M.ScType:Thesis
University:University of Alberta (Canada)Candidate:Wang, PengFull Text:PDF
GTID:2458390008493299Subject:Computer Science
Abstract/Summary:
Based on the dominance of HTML documents on the Web, this thesis proposes a keyphrase extraction algorithm (Extoken) that combines the traditional lexical statistics information and HTML formatting elements to extract a list of ranked keyphrases. We take the view that keyphrase extraction should be used as the foundation of further text related tasks instead of being treated as the end of the processing. In particular we used the results of keyphrase extraction to retrieve the original Web pages from the Web and studied the effect of using keyphrases for partitional query-specific document clustering in the domain of the Web. We compared the effectiveness between traditional ranked list results and query-specific document clustering solutions and performed a comparative study of the variance of clustering effectiveness across different document representations: keyphrases, full document and document snippet.; Two online prototypes are developed in the course of the research: Phrastractor, an online Web document keyphrases extraction system powered by our HTML elements aided keyphrases extraction algorithm Extoken, and Categorizer, a clustering meta search engine prototype that is built on top of query results returned from Google. (Abstract shortened by UMI.)...
Keywords/Search Tags:Clustering, Keyphrase, Document, HTML, Web, Query-specific
Related items