A comparative study of keyphrase-based query-specific clustering on WWW

Posted on:2005-04-29

Degree:M.Sc

Type:Thesis

University:University of Alberta (Canada)

Candidate:Wang, Peng

Full Text:PDF

GTID:2458390008493299

Subject:Computer Science

Abstract/Summary:

Based on the dominance of HTML documents on the Web, this thesis proposes a keyphrase extraction algorithm (Extoken) that combines the traditional lexical statistics information and HTML formatting elements to extract a list of ranked keyphrases. We take the view that keyphrase extraction should be used as the foundation of further text related tasks instead of being treated as the end of the processing. In particular we used the results of keyphrase extraction to retrieve the original Web pages from the Web and studied the effect of using keyphrases for partitional query-specific document clustering in the domain of the Web. We compared the effectiveness between traditional ranked list results and query-specific document clustering solutions and performed a comparative study of the variance of clustering effectiveness across different document representations: keyphrases, full document and document snippet.; Two online prototypes are developed in the course of the research: Phrastractor, an online Web document keyphrases extraction system powered by our HTML elements aided keyphrases extraction algorithm Extoken, and Categorizer, a clustering meta search engine prototype that is built on top of query results returned from Google. (Abstract shortened by UMI.)...

Keywords/Search Tags:

Clustering, Keyphrase, Document, HTML, Web, Query-specific

Related items

1	Study On Text Clustering And Keyphrase Extraction Of Patent Document
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	Web Document Automatic Classification Based On Keywords
4	Study On Some Key Techniques Of Non-fully Structured XML Query Processing
5	Search term selection and document clustering for query suggestion
6	Research On The Keyphrase Extraction And Relevant Technology
7	Statistic-based Automatic Keypharse Extraction And Summarization From Multi-document
8	The Research On Keyphrase Extraction Method Of Scientific Literature Based On Feature Representation
9	Chinese Keyphrases Extraction Technique
10	Research Of Query On The Probabilistic XML Document