Font Size: a A A

Research On The Key Methods For Web Spam Detection

Posted on:2017-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:S WeiFull Text:PDF
GTID:2308330485988690Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Web spam refers to those Web pages, where spammers play tricks to make their rank higher than they really deserved. Web spam seriously damages the interests of the search engines, legitimate websites and all kinds of users, so the study of Web spam detection has become a difficult but important task.For the link-based spamming, ranking algorithms are effective methods. Because good-to-bad links in the Web can degrade the detection of ranking algorithms, and most ranking algorithms don’t take content features into consideration, this thesis improves the Anti-TrustRank and WATR (Weighted Anti-TrustRank) algorithm and proposes a distrust ranking algorithm TLDR (Distrust Rank based on Topic and Link integration) based on combining topic similarity with link weight to adjust the propagation. In TLDR, topic weight is computed by Latent Dirichlet Allocation model and link weight is computed according to the link structure. The experimental results show that TLDR can make the Web pages get more reasonable distrust values so that it obviously outperforms Anti-TrustRank and WATR.In view of the content-based and link-based spamming, this thesis quantifies the Web quality from three dimensions:Web sources quality, Web content quality and Web application quality. Besides, the semantic features are extracted from the aspects of harmful degree and topic characteristics. Then the quality and semantic features are combined with content and link features to build a discriminative feature set. In addition, the classification algorithms are often used to solve the problem of Web spam detection, but their performance are easily affected by unbalanced data while outlier mining algorithms are applicable to unbalanced situations. Therefore, this thesis employs the Entropy-based Outlier Mining (EOM) algorithm and designs a cascading detection framework. The detection can be divided into three stages: content, link and semantic characteristic detection. The results of a series of contrast experiments show that the quality and semantic features can effectively improve the detection, the EOM cascading detection framework performs well and has more advantages than classification algorithms in the case of data unbalancing.
Keywords/Search Tags:Web Spam Detection, Distrust Rank, Outlier Mining, Cascading Detection, Multiple Features
PDF Full Text Request
Related items