Font Size: a A A

Search Results Clustering Method Based On Maximal Frequent Itemsets

Posted on:2010-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:C SuFull Text:PDF
GTID:2178360332957853Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive information growth on Internet, how to help users to locate the information they need becomes an urge and important issue. Clustering the search results on line can solve the problem by showing users results in groups. However, since the clustering of search results is real-time, and the resultant cluster labels should be readable, the traditional clustering algorithms can't meet the need. Besides, most previous research is based on webpages snippets and the accurracy needs to be improved. In this paper, we research and design a new clustering algorithms based on the full-text of webpages on the platform of an open-domain search engine. Through the investigation of the frequent itemset and its use in clustering algorithms, we proposed a search results clustering algorithm based on maximal frequent itemsets (Maximal Frequent Itemsets Clustering, MFIC). Through using dymanic min support and selecting maximal frequent itemsets as the foudation of clustering, MFIC breakthroughs the bottleneck of frequent itemsets used in on-line clustering. Cluster labels are also generated from frequent itemsets.This thesis mainly includes the following contents:(1) With the consideration of the search results, preprocess the web pages, apply the dynamic min support method in frequent itemsets mining and mine maximal frequent itemsets instead of mining all the frequent itemsets, improve the usability of frequent itemsets; and fulfill the real-time demand?(2) Design and implement web pages online clustring system, which computes the similarity and cluster pages based on the relation of frequent itemsets' covered pages set and words set;(3) Design the labels generation algorithm, combined with frequent itemsets and the order of word sequence, extract phrase labels, improve the label generation of the clustering algorithms based on frequent itemsets;(4) By comparing with other clustering algorithm experimentally, this thesis validates the advantage of our online clustering method.Finally, the system has been successfully used in an intelligent web information retrieval platform. Experimental results show that the proposed method can meet the requirements of online clustering, especially in time complexity and precision.
Keywords/Search Tags:search engine, text clustering, frequent item set
PDF Full Text Request
Related items