Font Size: a A A

Extracting Local Web Communities Using Lexical Similarity

Posted on:2011-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2178330332961409Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of Web information, it has become more and more important and challenging research problem that how to retrieve latent and useful information adequately among giant amount of Web information and utilize Web information efficiently in information field. It is very valuable to search Web commuity discovery in practice and academic study. The task of Web community extraction is to find all the cohesive Web pages given a specific query. It will redound to enhance the performance and precision of Web information retrieval and implement Web information clustering in some ways when Web community extraction algorithm is used to search engines.Based on the analysis of current Web and its data character, Web information retrieval model and the architecture of search engine, the classical Web community discovery algorithms are studied attentively most of which focus on link analysis without considering the textual property of Web pages. This paper proposes an improved algorithm based on Flake's method using the maximum flow algorithm.The improved algorithm considers the differences between edges in terms of importance, and assigns awell-designed capacity to each edge via the lexical similarity of Web pages.Given a specific query, it also lends itself to a new and efficient ranking scheme for members in the extracted community which strenghthens the differnence between members via their content similarity to seeds.We also propose an aggregation algorithm which constructs a vicinity graph on the granity of sites rather than pages according to the user's need. The experimental results indicate that our approach efficiently handles a variety of data sets on avoiding topic drifting and increasing both size and the quality of the extracted community.
Keywords/Search Tags:Information Retrieval, Community Extraction, Maximum Flow Algorithm, Lexical Similarity
PDF Full Text Request
Related items