Font Size: a A A

Search Results Clustering Based On Web Structure

Posted on:2011-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:S WenFull Text:PDF
GTID:2178360308963592Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Nowadays, the Internet has become one of the most important information sources, and more and more people use Search Engine as the first step of their surf. The traditional way of displaying search results in a one-dimension way, however, no longer meets the need of getting information efficiently. Three solutions have been proposed: query recommendations, personalized search, and search results clustering.Search Results Clustering is still far from satisfactory, though it has been studied extensively and for a long time. The main disadvantages: processing time is too long, cluster labels are not readable enough and cluster classification accuracy is too low. To avoid these drawbacks of the traditional search results clustering based on the summaries' similarity, this paper proposes a way to cluster search results according the web structure in an Intranet.A search results clustering system based on web structure is designed and implemented in this paper. It crawls web page, parses web structure, and determines web pages' semantic path offline, and merge semantic paths online once search results are returned. As we tag every web page in advance with a semantic path, what we do online is just merging these semantic paths, the processing time is cut down dramatically. According the observations, this paper proposes three rules to filter the non-hierarchical link, that is: a. there is no semantic child page for a topical page; b. links in the same link cluster points to the web pages of the same type; c. a link pointing to a semantic child web page is always at an outstanding position, compared to a link otherwise.In the last section, we compared the method we proposed with STC and Lingo, two famous search results clustering methods proposed by O.Zamir and O.Etzioni, Osinski Stanislaw and Dawid Weiss, respectively. As there is no similarity computation of search results' summaries, the method in this paper is much faster than Lingo. And as the cluster labels are extracted from anchor text, cluster label readability is more satisfactory as well. Compared to WWW pages, pages within intranets are more homogeneous, and so is the information need of people who use intranet search, which is why search results clustering according to web structure is better than that based on summaries' similarity.
Keywords/Search Tags:Search Results Clustering, web structure, data clustering, search engine
PDF Full Text Request
Related items