Font Size: a A A

Research On The Key Problems Of Web Community Discovery Based On Multiple Features

Posted on:2008-09-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y GaoFull Text:PDF
GTID:1118360215999012Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, WWW has been the giantand distributed information resource in the global and provides usinformation about news, finance, advertise, business, culture andeducation etc. How to obtain the information in the Web or discoverknowledge hidden in Web quickly and accurately is people's urgentdemand.Community is a collection of web pages that are highly related,interconnected, and share the same topic. Extracting knowledge fromcommunity is a quick and efficient way to discover knowledge in the Web.Community discovery is to discover the hidden community and definedcommunity from the distributed and disordering environment of Internet.In this paper, we deeply research on several key problems of webcommunity discovery: 1) pre-processing of web pages; 2) topiccommunity discovery based on multiple types of features; 3) The modelof information retrieval based on topic community. The main work in thispaper is:●For pre-processing of web pages, we propose a new algorithm toextract the contents of web pages. In this algorithm, web page isdivided into blocks by the new objective function which isestablished according to the degree of coupling between blocks'and the degree of coherence of blocks. "topic" or "topic-relevant" blocks can be extracted by the blocks'contents and structure information. Merging these blocks'contents, the main content of web page can be available.Experiment on the web pages of three sites indicates thealgorithm's effectiveness for extracting contents of any type ofweb pages.●Clustering ensemble is the main part of topic communitydiscovering based on multiple types of features. In this paper, theobjective function based on mutual information introduced byStrehl&Ghosh is extended and a new ensemble algorithm isproposed to combine "soft" partitions. Experiments on four real-world data sets indicate that our algorithm provides solutionswith improved quality.●The quality of clustering ensemble does not only depend on"consensus function", but also depends on the distribution ofpartitions participating in ensembles. The larger diversity thesepartitions have, the higher quality the ensemble has. In this paper,considering the influence of diversity, a weighted clusteringensemble algorithm based on diversity to combine "soft"partitions is proposed. When the diversity distribution is unevenlyor the expectation of diversity distribution is lower, this algorithmcan improve the quality of the ensemble.●For web community discovering, we should use basic clusteringalgorithm on different feature sets of web page to produce thepartitions before clustering ensemble. Here, we adopt informationbottleneck algorithm and extend the information bottleneckalgorithm to multi-view setting. By combining multi-viewinformation bottleneck with the multi-view representation of webpage and the ensemble algorithm based on mutual information,we propose a multi-view web community discovery algorithm.Experiments indicate the efficiency of the new algorithm.●The model of information retrieval based on topic community isproposed in this paper. This model defines the mediated layerbetween user and general search engine. Users access thepre-defined community model through the mediated layer andidentify the needed topic. The mediated layer refines the query bythe needed topic, generates a mediate query that helps us searchinformation in web through general search engine.
Keywords/Search Tags:community discovery, content extraction, clustering ensemble, multi-view learning, information retrieval
PDF Full Text Request
Related items