Font Size: a A A

Research On Automatic Seed Set Expansion Algorithm In Anti Search Engine Spam

Posted on:2010-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2178360302960831Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of search engine and World Wide Web, the burgeoning information on the web can be shared and obtained by the people; meanwhile it also leads to abundant search engine spam. Search engine spam refers to utilizing various methods to let certain pages ranking higher than they deserve. Search engine spam can cause many problems, e.g., increasing the operational cost of search service provider, decreasing the users' satisfaction on the search results.Conventional heuristic rule based anti-spamming algorithms cannot demote the spam universally, and they are easily to be ruined by the spammers. Trust propagation based anti-spamming algorithms are robust to spammers' attack and they can demote the spam universally. However, the efficiency of these algorithms highly relies on the seed nodes, the quantity and quality of the seed set will restrict the efficiency of algorithms in anti-spamming. Traditionally, the seed set is constructed via a manually evaluation, nevertheless, this kind of approach not only restricts the seed set to be small in size, but also cannot guarantee the quality of the selected seed set. Thus, how to select or expand the seed set to meet the need of trust propagation based anti-spamming algorithms has become a challenge issue in research.This paper proposes the ASE algorithm (Automatic Seed Expansion), introduces the definition of reputable support degree between nodes and incorporate the domain knowledge together with joint recommendation topologies to expand a small sized seed set into a large reputable less domain-biased seed set, therefore, meeting the need of trust propagation based algorithm in the seed set quantity and quality. Meanwhile, the paper further analyzes how to select the initial seed set. It gives two heuristic methods (combineSelection and thresholdSelection) for different scenarios, and analyzes the algorithm efficiency, advantages and disadvantages. It is proved by experiments on WEBSPAM-2007 data set that by applying ASE to the TrustRank, the algorithm efficiency, compared with the original TrustRank, improved 27.2% and 49.5% in reputable node promotion and spam node demotion, respectively, indicating the effectiveness of the ASE in improving anti-spamming algorithm efficiency.
Keywords/Search Tags:Link Analysis, PageRank, TrustRank, Search Engine Spam
PDF Full Text Request
Related items