Font Size: a A A

Research On Cluster Algorithm For Web Object

Posted on:2011-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ShengFull Text:PDF
GTID:2178360302474599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and Web2.0 site, clustering of objects in web documents has recently become a hot topic in the research community of web information retrieval. Generally, quality web IR requires fine-grained clustering of objects in documents. However, the present clustering algorithms are mostly confined to the level of sentence structure or textual topic which can't deal with semi-structure web data. The lack of consideration of token information for identifying more detailed-level objects often leads to coarse-grained clustering results.To address these problems, we propose a novel fine-grained clustering algorithm which captures probabilistic hierarchy property between tokens. First, it construct a directed acyclic graph of information-transmission from token frequency sequences implying the token information distribution, and then it mine the associations of hidden attributes as the signatures of objects in unstructured data by trigger-pair model. It could group tokens which help identify objects and filter out noise. Then it assigned weight to tokens properly which made feature vectors more representatives for identifying objects. Second, we propose a self-tuning method for merging records that are of high similarity to each other. This can effectively reduce the impact introduced by noises by giving duplicate records a second chance to choose the final clusters.Our experiments on real datasets show that the proposed clustering algorithm can filter noise, set the proper weight to the feature, and outperform the conventional algorithms considerably with average improvements of 21.3% in terms of the F-Measure, which can be used in multidisciplinary web object clustering.
Keywords/Search Tags:information distribution, trigger pair, similar histogram, cluster
PDF Full Text Request
Related items